1/2 


AD-A152  605 
UNCLASSIFIED 


TOWARDS  A  STATISTICAL  ANALYSIS  OF  GENETIC  SEOUENCES 
DATA  WITH  PARTICULAR  (U)  MASSACHUSETTS  INST  OF  TECH 
CAMBRIDGE  STATISTICS  CENTER  S  P  ARSENIS  MAR  85 
TR-36-ONR  N80014-74-C-0555  F/G  6/3 


NL 


AD-A153  605 


* 


STATISTICS  CENTER 

Massachusetts  Institute  of  Technology 


77  Massachusetts  Avenue  Rm.  E40-111.  Cambridge.  Massachusetts  02139  (617)253-8722 


TOWARDS  A  STATISTICAL  ANALYSIS  OF  GENETIC  SEQUENCES  DATA 
WITH  PARTICULAR  REFERENCE  TO  PROTEIN  SEQUENCES 

BY 


SPYROS  P.  ARSENIS 

MASSACHUSETTS  INSTITUTE  OF  TECHNOLOGY 


Cl. 

O 

cc 

Uj 


TECHNICAL  REPORT  NO.  ONR  36 
MARCH  1985 

PREPARED  UNDER  CONTRACT 
tiCCCT  4-74-C-0555  (MR-6C9-001 } 

FOR  THE  OFFICE  OF  NAVAL  RESEARCH 

Reproduction  in  whole  or  in  part  is  permitted  for 
any  purpose  of  the  United  States  Government 

This  document  has  been  approved  for  public  release 
and  sale;  its  distribution  is  unlimited 


DTIC 

ELr  ~  s 

may  1  3  1985 


85  °4  15  Of?  5 


TOWARDS  A  STATISTICAL  ANALYSIS  OF  GENETIC  SEQUENCES  DATA 
WITH  PARTICULAR  REFERENCE  TO  PROTEIN  SEQUENCES. 


by 


3PYRCS  ?.  ARSENIS 


ABSTRACT 


This  report  develops  a  variety  of  character  matrices  as  graphical  tools 
for  the  visual  examination  of  genetic  sequences  and  in  particular  protein 
sequences.  The  NNC,  PNC,  BNC1,  BNC2  and  BNC3  matrices  are  designed  to  filter 
noise  without  severely  suppressing  signals  in  the  CC  matrix.  The  Matrix  Smear 
of  a  character  matrix  is  introduced  as  a  measure  of  signals  and  noise  in  the 
matrix.  The  asymptotic  distribution  of  the  smears  of  the  CC  and  NNC  matrices 
are  derived  under  the  independence  model.  The  asymptotic  result  is  used  in 
conjunction  with  exact  confidence- intervals  from  diagonal  smears  to  automate 
partially  the  visual  examination  of  character  matrices.  A  generalized  likeli¬ 
hood  ratio  procedure  is  developed  to  automate  fully  the  detection  of  signals 
in  two  protein  sequences.  A  simulation  study  has  proven  the  procedure  to  be 
powerful  and  robust  in  detecting  signals  of  success  probability  .90  and 
length  9  implanted  within  noisy  binary  strings  of  length  291  characters  and 
success  probability  .15. 


Some  Key  Words:  Genetic  sequences,  DNA,  Matrix  Smear,  Character  Matrix  Graphics 
AMS  1980  subject  classification.  Primary  62P10 


5 


TABLE  OF  CONTENTS 


PAGE 

1.  Introduction,  biological  background  and  nomenclature . 6 

2.  Character  matrices  as  exploratory  tools  for  genetic 

Sequences  Data . 16 

3.  Statistical  Properties  of  Smears  of  Character  Matrices . 33 

4.  Smears  along  Diagonals  of  Character  Matrices . 48 

5.  Automated  detection  of  Signals  within  two  Words . . . 84 

Appendix  1 . 115 

Bibliography  . 117 


1.  INTRODUCTION,  BIOLOGICAL  BACKGROUND  AND  NOMENCLATURE 


The  subject  of  this  research  is  the  development  of  a  statistical 
methodology  to  analyte  protein  and  DNA  sequence  data.  Various  data 
analytic  tools  presented  here;  their  development  'ns  motivated  from  the 
examination  of  fourteen  DNA  sequences  which  encode  proteins  forming  the 
eggsaell  of  the  American  silkmoth  Antherea  po lyphemus-.  The  genes  were 
sequenced  in  the  laboratory  of  professor  Fotis  Tafatos. 

The  question  that  was  initially  posed  by  Fotis  Eafatos,  was  to 
cluster  the  fourteen  genes  on  the  basis  of  their  similarities  within 
regions  where  similarities  had  already  been  detected.  A  measure  of 
similarity  between  strings  was  developed  and  its  application  to  the 
regions  where  the  genes  had  been  detected  to  be  similar  produced 
clusters  that  made  good  biological  sense. 

To  find  out  if  there  were  other  regions  where  the  fourteen  genes 
were  similar,  graphical  ways  to  represent  the  data  were  required.  Through 
these  it  became  clear  that  the  genes  shared  similarities  far  more 
extensive  than  previously  detected  and  that  there  was  a  lot  of  structure 
within  each  gene,  basically  in  the  form  of  consecutive  repeats  of  a  basic 
repeat  unit. 

Chapter  2  presents  a  variety  of  character  matrices  as  graphical 
tools  to  allow  the  investigator  to  look  into  string  data.  These  matrices 
are  designed  so  as  to  reduce  the  matrix  smear  -  which  is  a  measure  of 
•signals*  and  "noise*  in  the  data  -  without  suppressing  'signals*. 
Chapter  3  presents  an  aymptotic  result  for  the  distribution  of  the  smear 
of  some  of  the  matrices  of  chanter  2,  under  the  assumption  that  strings 


Are  written  independently  between  and  within  themselves.  Chapter  4 
compares  the  matrix  to  the  diagonal  smears  to  'automate*  the  visual 
examination  of  character  matrices.  Chapter  5  develops  a  machine 
examination  of  character  matrices  by  listing  the  significant  substrings 
of  the  words  which  maximize  a  generalized  log-likelihood  ratio  for  the 
hypothesis  that  for  two  parameters  Pq  and  p^,  pgipj.  the  probability  of  a 
match  is  smaller  than  Pq  vs.  the  alternative  hypothesis  that  it  is  larger 
than  p-j_ .  Chapter  1  now  presents  the  biological  background  necessary  to 
pose  questions  relating  to  genetic  sequence  data  and  concludes  with  the 
presentation  of  the  chorion  data  set.  The  compendium  is  based  on  Dayhoff 
[6].  Hood  [8].  Mahan  [10],  and  Vatson  [15]. 

Observed  via  a  microscope,  chromosomes  are  paired  threadlike 
structures  in  the  nuclei  of  living  cells.  Since  the  beginning  of  this 
century,  chromosomes  were  recognized  to  be  responsible  for  the 
transmission  of  the  hereditary  properties  of  organisms  via  their 
subunits,  called  genes.  As  little  had  been  known  about  their  structure  at 
the  molecular  level,  however,  genes  were  considered  as  black  boxes  until 
rather  recently. 

A  chromosome  is  a  giant  DNA  molecule.  Proposed  by  Vatson  and  Crick 
in  1953,  the  structure  of  DNA  is  that  of  two  intertwined  strands  giving 
the  molecule  the  shape  of  a  double  helix  as  illustrated  in  figure  1-1. 

The  backbone  of  each  strand  is  provided  by  the  sugar  molecule 
deoxyribose.  The  structural  formula  of  deoxyribose  is  shown  in  figure  1- 
2.  On  the  one  apex  of  the  pentagonal  ring  stands  an  oxygen  (0)  atom,  the 
other  four  being  occupied  by  carbon  (C)  atoms.  On  the  deoxyribose 
molecule  there  are  five  C  atoms  indexed  by  the  integers  1,  2,  3,  4,  and 


o 


The  structural  formulae  of  the  five  bases  are  shown  in  figure  1-3. 
To  the  3  and  5  C  atom  sites  of  deoxyribose  are  attached  phosphate  groups 
(  PO^  )  that  provide  the  links  between  successive  sugar  molecules  in 
the  DMA  strand  as  illustrated  in  figure  1-4. 


Figure  1-4.  The  structure  of  a  strand  of  a  DMA  molecule. 


10 


The  combination  of  the  deoxyribose  molecule  with  one  of  the  bases 
and  the  phosphate  group  is  called  a  nucleotide .  The  phosphate  and  the 
deoxyribose  always  being  the  sane,  nucleotides  are  denoted  by  the  base 
molecules  T,  C,  A,  0,  or  0  which  are  attached  to  the  deoxyribose. 

The  helical  structure  of  DMA  is  made  possible  by  bonds  anon®  bases 
in  opposite  strands.  In  particular,  thymines  bind  to  adenines  and 
guanines  to  cytosines  ("base  pairing  rules').  Consequently,  DMA  nay  be 
presented  by  the  sequence  of  nucleotides  in  one  strand,  together  with  the 
direction  in  which  that  sequence  is  read.  The  convention  established  in 
the  biochemical  literature  is  that  a  sequence  of  letters  from  the 
alphabet  of  T,  C,  A,  G  represents  the  nucleotides  from  the  chain  end  on 
the  5  C  atom  of  deoxyribose  to  that  on  the  3  C  site.  ’Tith  this 
convention,  DMA  sequence  data  will  be  considered  as  words  written  in  the 
alphabet  of  the  four  bases  (T,C,A,G),  They  will  be  denoted  as  finite 
sequences  X  =  (X, for  X ^ s {T , C , A, G] . 

At  the  molecular  level,  a  gene  is  a  piece  of  the  DMA  molecule 
usually  several  hundred  base  letters  long.  A  gene  encodes  and,  under 
certain  conditions,  directs  the  synthesis  of  a  protein  as  is  sketched 
later  on  in  this  section.  The  protein  coding  portion  of  a  gene  starts 
with  the  letters  ATG  and  ends  with  one  of  TAA,  TAG,  or  TGA. 

Proteins  are  molecules  found  throughout  living  organisms  acting  as 
enzymes  (catalyzing  various  biochemical  reactions)  or  forming  membranes 
of  cells  and  other  cellular  structures  (playing  a  structural  role).  The 
building  blocks  of  proteins  are  the  amino  acids.  Table  1-1  gives  the 


alphabet  in  which  the  twenty  amino  acids  are  conventionally  abbreviated. 


■<’  i# 


11 


Table  1-1.  1-letter  abbreviations  for  the  twenty  amino  acids. 


1 

Phenylalanine 

F 

11 

Isoleuc ine 

I 

2 

Leucine 

L 

12 

Methionine 

M 

3 

Serine 

S 

13 

Threonine 

T 

4 

Tyrosine 

Y 

14 

Asparagine 

N 

5 

Cysteine 

C 

15 

Lysine 

K 

6 

Tryptophan 

W 

16 

Valine 

V 

7 

Pro line 

P 

17 

Alanine 

A 

8 

Histidine 

H 

18 

Aspartic 

D 

9 

Glutamine 

Q 

19 

Glutamic 

E 

10 

Arginine 

S 

20 

Glycine 

G 

For  oar  purposes,  sad  ia  the  absence  of  other  information  about 
their  structure,  proteins  are  words  written  ia  the  alphabet  of  the  twenty 
letters  of  table  1-1  and  denoted  as  finite  sequences  X  *  (X, ,...,X  ), 


for  all  X-  ia  the  alphabet  of  the  twenty  letters.  A  proteia  sequence  is 


written  in  the  direction  in  which  its  encoding  DNA  sequence  is 
conventionally  written,  each  aaiao  acid  encoded  by  three  consecutive 
nucleotides  as  will  be  explained  below.  Proteins  and  DNA  sequences  will 
be  interchangeably  referred  to  as  words  or  strings;  stretches  of  the 
above  will  be  referred  to  as  syllables  or  substriags. 

The  synthesis  of  a  protein  is  directed  by  its  corresponding  gene 
through  the  following  two  step  mechanism: 

(1)  Transcription  of  DNA  to  nRNA.  One  of  the  two  strands  of 
the  DNA  molecule ■ acts  as  a  template  which  appropriate  enzymes  copy  into 
SNA,  a  chemically  similar  molecnle.  SNA  is  a  single  stranded  molecule 
built  np  of  nncleotides  bound  to  each  other  as  in  DNA.  The  bases  in  the 
SNA  nucleotides  are  A,  G,  C,  and  U.  They  are  respectively  copied  from  the 


12 


T,  C,  G  and  A  bases  of  the  DNA  strand  under  transcription.  The 
transcribed  RNA  strand  subsequently  undergoes  'splicing*.  In  particular, 
regions  of  the  RNA  strands,  called  * introns"  for  intervening,  are  removed 
and  the  remaining  regions,  called  *exons*,  are  joined  together  to  form 
the  messenger  RNA.  (  mRNA  ) 

(2)  Translation  of  the  mRNA  to  the  protein.  The  mRNA  acts  as  a 
template  which  in  conjunction  with  other  components  of  the  cell  ( 
ribosomes,  tRNA,  etc)  directs  the  assembly  of  a  string  of  corresponding 
amino  acids  as  specified  by  the  genetic  code. 

The  genetic  code  is  shown  in  table  1-2.  It  maps  each  triplet  of 
consecutive  nucleotides,  called  a  codon,  to  an  amino  acid  except  for 
codons  OAA,  UAG,  UGA.  The  latter  codons  monitor  the  end  of  the  protein 
coding  region  of  the  gene  and  are  called  terminator  codons.  Codon  ADG  is 
used  as  an  initiator  or  for  encoding  methionine  internal  to  the  protein 
chain.  Since  61  codons  are  mapped  into  20  amino  acids,  amino  acids  are 
bound  to  be  encoded  by  more  than  one  codon. 

Table  1—2.  The  genetic  code  with,  codons  entered  in  a  three  way 
table. 


nuu 

F 

DCO 

S 

UAO 

T 

OGO 

C 

uoc 

F 

occ 

S 

DAC 

I 

OGC 

C 

UUA 

L 

UCA 

s 

UAA 

Term 

OGA 

Term 

DUG 

L 

OCG 

s 

OAG 

Term 

OGG 

¥ 

CUD 

L 

CCO 

p 

CAU 

H 

CCO 

R 

CTJC 

L 

ccc 

p 

CAC 

H 

CGC 

R 

CUA 

L 

CCA 

p 

CAA 

Q 

CGA 

R 

COG 

L 

CCG 

p 

CAG 

Q 

CGG 

R 

AUO 

I 

ACO 

T 

AAO 

N 

AGO 

S 

AOC 

I 

.  ACC 

T 

AAC 

N 

AGC 

s 

ADA 

I 

ACA 

T 

AAA 

K 

AGA 

R 

AOG 

V 

ACG 

T 

AAG 

K 

AGG 

R 

GOO 

y 

GCO 

A 

GAU 

D 

GGO 

G 

GO  C 

y 

GCC 

A 

GAC 

0 

GGC 

G 

GOA 

y 

GCA 

A 

GAA 

E 

GGA 

G 

GOG 

y 

GCG 

A 

GAA 

E 

GGG 

G 

Supported  by  fossil  and  biochemical  evidence,  the  fundamental 
evolutionary  scenario  of  biology,  postulates  that  billions  of  years  ago. 


life  on  earth  existed  in  a  simple  ancestral  fora.  'Thile  organisms 
evolved  from  their  common  ancestor,  numerous  mutations  accumulated  on 
their  genetic  material.  Mutations  occur  in  individual  organisms  by 
chances  over  time  they  may  spread  through  or  disappear  from  the 
population.  Their  laws  are  studied  in  evolutionary  biology  and  population 
genetics  and  are  not  directly  relevant  in  the  present  discussion. 

The  fundamental  scenario  adapts  to  the  biochemical  level  of 
description  of  organisms  as  follows:  living  organisms  undergo  mutations 
on  their  genetic  material.  .Mutations  of  two  hinds  have  been  observed.  A 
base  nav  suostitute  another  in  a  DNA  strand  and  give  rise  to  a  point 
rut  a t i j n  :ractions  of  a  gene,  whole  genes,  or  microscopically  visible 
pieces  ::  jtcsoces  may  duplicate,  become  deleted,  or  translocate. 
Segmental  mutations  refer  to  the  above  events  incurring  on  fractions  of 
genes.  Mutations  are  said  to  be  selectively  deleterious  to  the 
individual  organism  on  which  they  are  imposed  if  they  increase  the 
likelihood  of  that  the  individual  organism  dies  or  leaves  fewer 
descendants.  Other  mutations  may  offer  the  organism  selective  advantages, 
or  may  be  selectively  neutral.  Gravely  deleterious  mutations  are  censored 
by  natural  selection;  selectively  neutral  or  even  slightly  deleterious 
mutations  may  survive  or  even  become  fixed  in  the  population  by  chance. 

Figure  1-5  presents  the  coding  portions  of  the  fourteen  genes 
under  analysis.  The  genes  are  given  as  292,  292a,  292b,  609,  13,  13b, 
13c,  401,  401a,  401b,  408,  10,  10a,  and  19b.  On  the  basis  of  their 
extensive  similarities,  genes  292,  292a,  and  292b,  are  collectively 
called  292  copies.  (  Similarly  for  13,  18b,  and  13c  etc.)  The  first  seven 


14 

genes  will  be  referred  to  as  f aai ly  A.  Family  1  comprises  the  last  seven 

genes . 

On  the  basis  of  when  their  protein  products  are  formed  during  the 
formation  of  the  eggshell.  292  copies  and  gene  509  form  the  middle  A 
subfamily;  the  copies  of  18  form  the  late  A  subfamily.  The  late  2 

subfamily  is  made  of  the  copies  of  401  while  gene  408  and  the  copies  of 
10  form  the  middle  0  subfamily. 


1 


acuaa, 

JtSSA.  4  05 

AucueuACuoucecuuucuucuuccuciJCCAUccAcceuucccucci.'CCACAAOcucoucccc&ucucccc c-vj-:  .-rticrvct.— ; ; .■■: 

CCCu£aCuuCuCuCUCCCCaCCCCUC<CCUhCCMCCChCUCuCCUAC££UCCUCUCt:CvU'Kv«Cuw»rJi'  v:, 

lAA»y  ilC AOCUU ®C CO ACCCC aaCCC AyCC ^wAk«C CU  Cy  CCC'JC  a Cy CyuC ayCUCCCCyUy ycUyCyay.. m*. ■.  i.C  yiyywc  — C  *, '.  V-*-.' . ay . 

CCa ay  yvaUUCaCUy  CUCC  a CUCywyCaa aC mw^uw wCaUaCUaUCyCC auC muCCG Ay y y Uyy aC  'UCaCrf^  ■".  ■* . '.  .  -  “  -  . 

ACUAM 

-»23  4  05 

aocyc'jACwyycccyyucyucuuccucy  ccaucc acccyuc; ,'jccucc  auuuuuucuucc;;2ij';.; cc  :c . :  cc  ;  ..c  icccuc  ---.  r..c  -  i :  .■ . 

■«  C  Ga  y  Ua  Ua  u  w UuvU'*aC^^wLC2«vv*,,i»4AC*«hhC,0'»hh'»'JAC,— tiUuwUUvC«wJU*C'*ACn,.#»*,*,jL*  —  ■.  .j  mC  - 

xfficScAuaiuacccMCCcc—cccAcJJc'.'A-ci-cii:  c3uc  ccjcaJuAccuccI ccycycuccy-c  yc  j  .•cccii-Ca-ic  .• :  e  c -.•  • 
C&y  UyUCUUa  Ayy UCUCaaCUCCCCC  a  a  AC  yy Uayy CaUC Cy a UC  UCCauCyuCyyACyCUywaCwyy>.  JwbwCUUu.  JCuUu**u«  1  ^(".w  .  .  -  'J  u 

-ACUAM 

444  344 

AOCUCUACCuiJCyCUuyCuUCUUCCUCuCCaUC  'aMCCuuCi  i  JyCuCCay-aUCliyUij'jCyCyi.'C  v  iCCC..  Ci'jyiiyC  vyCCyCCCC  .  '.  -. 

cccsACCccucicuuAccAAccACuciiccc'jucsuccucucc-cvAuiaaccAAuuccuyACiCAiCuiicviCiiC'.'-i  ^aMCyuu'j’.'v:  .  -■:  v. 

AiUuuA4uui*auL  yayAAyCuCCCUayaCCUaCaaaaCay  a  a*.  a y y u»  Uyy ya aAay  Cwy  w  CU4Cw**Cv*»'  ..CauC  C  •.  .  aaC  ay  C  s»  y  *.  aw- 


UuiJCUCuCi:yyCyCCyyCayayyUy.iayCAyCliyyaUyCuyay  yCytiUCyCCCCAUUyACUay 
IKS  366 


AaaaCuaw’JaCaaCCaUuUy  aCCyaOauACauC yCCCCy AC uaMaaMauCliaAaliulyACaauC UUaaUUM  .  uyaCyC  '-’ m  -  '^uulj1. uuU'j  j  **u 
-CCCUUCCCauCCUUaCCUC'.'AA 

441 A  525 


4AaCyyyuUwyy  cccO  AyCiicyccucuyCACyCccu  Ay  ucccc  accccyccuyc  uccy  c  yCAycaac  yycccyyycc.iy  Caccc  accoCy  c  ccy-.*.**'  aC  ■.  ■  ■ 

C^ACCyCCyyyaCCCCGyUCUUCCCUAUCCACaOCUCJCCUACaAaCCACUCMACCCUACCCUCUUy.;....^ yy*.yy  C yaCayaC (JCa-l  -"".aw. '*« .  y  a 


cucuucuycccoyccuu-ccsjcuAA 

4«5  _  31 3 


aAyUyAyyyuUCCy  vuyy  CCyCCCy  C‘ya  Cy  Cy  y  aUCa  aCy  ayy  CUyCCayy  Cay  aaayy  Uy  UCCy  UaU’,. -i  iaC  v Cay  I  - CCC vay y  i C  j  i  v  i  l-»  .  "  - 
•£y5tJ&?,J’*a'jyuUCiiyClJuiiCyU«CCAUyyUCuuyiwCuCCywyMCyCAtiCC  yCyaUytiyyUaCaaayCC  vw.'',  i  .  C  ICaCCC '."i  '.  lv  .  C  .  i  ■’.■■.• . 
•CSCCCCC  UU  CD  A  A 
««  442 


1 8A  442 

MwscAaccAaaccuuuucucAucuccceuueuccccucyuucuueACucueeucucAcccacucucyv'iCiiyaoc  acy-ccccccucv;.. 


CCUAO&AuCSUCUUCfiUCUCCCCUACSCCCCUC&UAUCC6CUACAaCiCCUACCaUCUUCCAtCCUC.CiCy‘.'CCaiJ'JC.y  CCUCCCr.uCy-a 
J  flfl  43$ 

aiTi" .<a.- ui- .ani>i  ii  t,*i «i  i.a  iaaa.»m  lai  i.-i-i-ai lai u  ii i  -i n ia Ai2UC0  C  l.’  ‘  i  it  t  v.yi'L  t  Cy  ■*  a  O'CCl  C  C  a  C . 


aeCAACCuCyCCCyCUCCuCUAwCyCwCUCyUaUCCwCUACMayCaCUMCwuUCUOtiuauti 


riiure  1-5.  The  ?_VA  sequences  for  the  codinr 


;enes  of  families  A  and  P- 


e;ions  of  the  chorion 


2.  CHARACTER  MATRICES  AS  EXPLORATORY  TOOLS  FOR  GENETIC  SEQUENCES  DATA 


The  proteins  encoded  by  the  chorion  genes  under  analysis  are  listed 
in  figure  2-1.  Hunan  vision  is  inadequate  to  detect  structure  within  or 
similarities  between  the  proteins  as  they  are  presented.  This  chapter 
introduces  a  variety  of  character  matrices  which  proved  useful  in 
bringing  out  similarities  between  different  proteins  and  repeats  within 
proteins.  Character  matrices  have  been  constructed  for  DNA  and  amino 
acid  sequences  throughout  this  research.  In  this  chapter  they  are 
introduced  in  the  general  context  of  two  words  and  illustrated  for  some 
of  the  proteins  of  figure  2-1. 

Let  X  =  (X^,.,.,Xm)  and  Y  =  (Y^,...,Yn)  be  words  written  in  the 
alphabet  [aj,...,as].  X  and  Y  may  or  may  not  be  the  same.  The  Crude 
Character  (CC)  matrix  for  X  and  Y  is  defined  by: 

M.  .  =  Xi  lf  Xi=Yj  (2.1) 
■  •  (blank)  otherwise. 

i=l,...,m  and  j=l,...,n. 

The  idea  of  using  two  dimensional  arrays  to  look  into  string  data 
appeared  first,  latently,  in  figure  1  of  Needleman  and  Wunch  [11]  in 
their  exposition  of  an  algorithm  to  compute  the  longest  common  subsequnce 
between  two  words.  CC  matrices  were  also  explicitly  constructed  in  Gibbs 
and  McIntyre  [7] 

Character  matrices  are  useful  exploratory  tool  for  looking  into 
sequence  data  because  a  substring  common  to  the  two  words  shows  up  as 
a  diagonal  in  the  CC  matrix  for  the  words.  Figure  2-2  presents  the  CC 
matrix  for  the  proteins  encoded  by  genes  292  and  1SB.  Two  major, 
relatively  solid  diagonals  can  be  distinguished  on  the  CC  matrix  of 


figure  2-2.  Tie  longest  diagonal  consists  of  entries 
(M75  117^  and  indicates  that  syllables  (X75 , . . . ,X12g)  and 

(Yga . . . . .^117)  are  similar  in  the  sense  that  Xi*Yi+12  for  i»63  . . . . ,  11"} 


exceot  for  a  few  occasional  mismatches.  The  structure  of  the  aatrix  block 
corresponding  to  (X^q,  . . .  ,X^)  and  (Y2<j,  . . .  ,Y^q)  will  become  clear  in 
matrices  to  be  oresented  later.  For  the  moment  it  is  noted  that  the 
longest  diagonal  in  the  block  is  (‘*43  33**  ••'••59  49)  an<i  parallel  to  it 
and  within  the  block  run  other  shorter  diagonals.  In  a  character  matrix 
for  two  words,  the  appearance  of  parallel  diagonals  at  a  substring  of 
one  of  the  words  signifies  the  existence  of  internal  repeats  in  the 
other  word,  as  illustrated  in  figure  2-3. 


Figure  2-3.  Parallel  diagonals  at  a  substring  of  X  are  due  to  and 
signify  repeats  of  the  substing  in  Y. 


The  CC  matrix  for  X*Y  brings  out  internal  repeats  within  word  X. 
It  is  symmetric  and  its  entries  are  nonblank.  The  CC  matrix  for 

protein  292  is  presented  in  figure  2-4.  (It  is  not  square  only  because 
the  characters  of  the  LP  used  are  rectangular.)  The  diagonal  string 
(■‘42  52'  *  *  *  '**51  61^  narked  on  figure  2-4,  runs  parallel  to  the  solid 
diagonal  of  the  matrix  and  is  formed  by  the  repeat  of  the  syllable 


(X42 , . . . . xs 1 )  as  (X52 . Xgj)  except  for  one  mismatch. 

The  usefulness  of  CC  matrices  is  limited  by  two  factors:  their  size 
and  •noise*  associated  with  them. 

A  common  line  printer  can  print  up  to  8  lines  per  inch  vertically 
and  up  to  132  characters  per  line  horizontally.  Therefore,  when  printed 
on  a  line  printer,  the  CC  matrix  for  a  word  of  500  characters  (a 
length  common  for  DNA  sequence  data)  is  longer  than  six  feet.  To 
diminish  the  size  of  the  matrices  the  investigator  has  to  prepare 
successive  photoreductions  at  the  expense  of  papercutting  and 
paperpast  ing .  This  limitation  may  also  be  circumvented  '07  presenting 
character  matrices  on  a  plotter.  A  digital  plotter  applies  a  large  grid 
(for  example  of  4095  by  3124  sites)  on  a  sheet  of  paper  of  desirable 
dimensions.  A  character  matrix  in  blanks  and  dots  may  then  be  plotted 
by  placing  dots  instead  of  alphabet  characters  at  the  appropriate  grid 
sites. 

The  second  limitation  of  CC  matrices  is  more  serious.  In 

attempting  to  trace  diagonals  the  human  eye  is  distracted  by  characters 

which  are  are  bound  to  appear  only  because  of  the  composition  of  the 

words.  In  particular,  if  the  counts  of  alphabet  characters  a2,...,a  in 

X  and  Y  are  m2,...,ns  and  a2,...n  respectively,  the  CC  matrix  for 

s 

X  and  Y  contains  )  aonbiank  characters.  Hence,  the  ratio  of 

1 

noublank  characters  to  all  the  characters  in  the  matrix  is: 

s 

S(X.Y)  =  )  — —  ,  (2-2) 

-  -  Am  n 

1 

where  m^.'m  and  n^/n  are  the  relative  frequencies  of  in  the  two 

words.  S(X,Y)  in  (2-2)  will  be  called  the  matrix  smear  for  the  CC 


matrix  of  X  and  Y. 

If  X  is  independent  of  Y  and  ( X t }  t*l,...m  and  {Yt}  t*l,...,n  are 
I.I.D.  with  Pr{Xt*a£>»p^  and  Pr(Yt“££>»q£  for  all  t  and  i,  the  matrix 
smear  is  a  sample  estimator  of  the  parameter 


a  =  l  Pili- 
1 


(2-3) 


ar  will  be  called  the  theoretical  smear  for  the  CC  matrix  of  X  and  Y. 
Under  the  above  independent  assumptions  the  theoretical  smear  is  the 
probability  of  a  nonblank  character  in  the  matrix. 

The  matrix  smear  of  the  CC  matrix  for  two  different  words  ranges 
from  0  (for  words  with  no  alphabet  character  in  common)  to  1  (for 
words  written  in  one  letter).  The  matrix  smear  for  the  CC  matrix  for  the 


word  a  is 


f  2 

l  <  IT5  * 


(2-4) 


S(X)  is  minimized  when  m£=...=ms.  The  minimum  attained  is  the  inverse 
of  size  of  the  alphabet  in  which  X  is  written.  Table  2-la  lists  to  the 
second  decimal  digit  the  smears  for  ail  pairs  of  chorion  proteins. 

Table  2-la.  Sowars  of  CC  matrices  for  all  pairs  of  chorion  proteins. 

292  292A  2923  509  IS  133  ISC  401  401A  4010  408  10  10A  103 


292 

292A 

2923 

509 

IS 

isn 

ISC 

401 

292 

.12 

292A 

.12 

.12 

2923 

.12 

.12 

.12 

509 

.12 

.12 

.12 

.12 

13 

.13 

.13 

.13 

.13 

.15 

13b 

.13 

.13 

.13 

.13 

.15 

.15 

18C 

.13 

.13 

.13 

.13 

.15 

.15 

.15 

401 

.13 

.13 

.13 

.13 

.15 

.15 

.15 

.15 

401A 

.13 

.13 

.13 

.13 

.15 

.15 

.15 

.15 

4013 

.13 

.13 

.13 

.13 

.15 

.15 

.15 

.15 

20 


Smears  for  all  pairs  raage  from  12%  to  15%.  Matrix  smears  within 
snbfaailies  are  stable  as  can  be  seen  from  table  2-lb  below. 


Table  2-lb. 

Range  of  m 

tars  of  CC 

matrices  within 

snbfaailies 

of  Chorion  proteins. 

Middle  A 

Middle  A 
.12 

Late  A 

Late  B  Middle  B 

Late  A 

.13 

.15 

Late  B 

.13 

.15 

.15 

Middle  B 

.12 

.13-. 14 

.13  .12 

The  matrix  smear  specifies  the  number  of  non  blank  characters 
appearing  in  a  given  matrix  and  can  be  thought  of  as  a  measure  of 
'signal*  and  *noise*  in  the  data.  Is  it  possible  to  reduce  the  smear 
without  substantially  supressing  diagonals  in  the  matrix?  Recall  that  the 
(i,j)th  entry  of  the  CC  matrix  was  defined  by  comparing  X^  to  7j .  Now 
consider  the  Next  Neighbour  Considered  (NNC)  character  matrix  for  X  and 
7,  defined  as: 

Mifj  -  Xi  if  VYj  aad  Xi+l=Vl  (2-5) 

*  *  otherwise. 

i=l,...,m-l  and  j*l,...,n-l. 

Figure  2-5  presents  the  NNC  matrix  for  proteins  292  and  188.  It 
clarifies  the  extensive  repeat  structure  in  the  block  formed  by 
(X^q. . . . ,Xg^ )  and  (Xjg, . . . ,Xjq)  and  brings  out  the  features  that  292  and 
18B  share  in  common.  If  the  syllable  (a{a,)  occurs  m4  t  and  n;  •  times  in 

1  j 

X  and  7,  then 

s  s  s  s 

1  I  * ij  ■  «-!•  1  l  »ij  -  »-!• 

i-1  j*l  i»l  j»l 

and  the  ratio  of  nonblank  entries  of  the  NNC  matrix  to  the  total  number 


of  matrix  entries  is: 


21 


S(X,Y)  - 

~  ~  i-  t-  -_1  n— 1 

11 


(2-5) 


The  ratio  of  equation  (2-5)  will  be  called  the  smear  of  the  NNC 
matrix  of  X  and  Y.  The  smear  of  the  NNC  matrix  for  one  word  becomes 


s  s 


and  attains 
size.  Table 
NNC  matrices 


a  minimum  equal  to  the  inverse  of  the  square  of  the  alphabet 
2-2  lists  np  to  the  second  decimal  digit  the  smears  of  the 
for  all  chorion  proteins. 


Table  2-2.  Smears  of  NNC  matrices  for  all  pairs  of  chorion  proteins. 


292 

292A 

292B 

609 

18 

18B 

18C 

401 

401A 

401B 

408 

10 

10A 

10B 

292 

.02 

292A 

.02 

.02 

292B 

.02 

.02 

.02 

609 

.02 

.02 

.02 

.02 

18 

.02 

.02 

.02 

.02 

.03 

18B 

.02 

.02 

.02 

.02 

.03 

.03 

18C 

.02 

.02 

.02 

.02 

.03 

.03 

.03 

401 

.02 

.02 

.02 

.02 

.02 

.02 

.02 

.03 

401A 

.02 

.02 

.02 

.02 

.02 

.02 

.02 

.03 

.03 

401B 

.02 

.02 

.02 

.02 

.02 

.02 

.02 

.03 

.03 

.03 

408 

.02 

.02 

.02 

.02 

.02 

.02 

.02 

.02 

.02 

.02 

.02 

10 

.02 

.01 

.01 

.01 

.02 

.02 

.02 

.02 

.02 

.02 

.02 

.02 

10A 

.01 

.01 

.01 

.01 

.02 

.02 

.02 

.02 

.02 

.02 

.02 

.02 

.02 

10B 

.02 

.01 

.01 

.01 

.02 

.02 

.02 

.02 

.02 

.02 

.02 

.02 

.02 

.02 

Smears  of 

the 

NNC 

matri 

ces 

range 

from 

1%  to  3%. 

Tho 

se  for 

the 

NNC 

matrices  for  the  same  protein  vary  between  2%  and  3%  compared  to  the 
minimum  .25%.  For  proteins  292  and  18B  the  smear  of  13%  for  the  CC  matrix 
is  reduced  to  2%  for  the  NNC  matrix. 

NNC  matrices  eliminate  a  number  of  nonblank  characters  appearing 
on  CC  matrices  that  only  blur  diagonal  strings.  On  the  other  hand, 
corresponding  to  two  syllables  that  are  identical  except  for  one 


mismatch,  the  CC  matrix  produces  a  diagonal  that  is  broken  at  one  point 


22 


while  the  NNC  matrix  breaks  the  diagonal  at  two  entries.  This  suggests  a 
third  character  matrix  for  which  4  are  defined  after  comparing  the 

1  *  j 

3-letter  syllables  <Xi-1.Xi.Xi+1>  and  ,Yj+1)  a*  follows: 

Xi  if  Xi*Tj  *ad  (Xi-l*Xj-l  or  Xi+l*Yj+l) 

*i.j  *  *  if  Xi-1-Jj-1>  Xi«j'  Xi+l“Tj+l  <2“7) 

'  *  if  otherwise. 

i*2,...,m-l  and  j»2,..,n-l. 

The  matrix  defined  in  equation  (2-7)  will  be  called  Both 
Neighbours  Considered  and  abbreviated  by  BNCl,  the  index  1  appended  to 
the  acronym  BNC  to  distinguish  it  from  other  matrices  defined  by 
comparing  3-letter  syllables.  Figure  2-6  presents  the  BNC1  matrix  for 
proteins  292  and  18B.  The  BNCl  matrix  allows  ap  to  nonconsecut  ive 
mismatches  in  similar  strings  without  breaking  their  diagonal.  Table  2-3 
presents  the  smears  of  the  BNCl  matrices  for  all  chorion  proteins. 


Table  2-3.  Smears  of  BNCl  matrices  for  all  pairs  of  chorion  proteins. 


292 

292A 

292B 

609 

18 

18B 

18C 

401 

401A 

40  IB 

408 

10 

10A 

10B 

292  .04 

292A  .04 

.04 

292B  .04 

.04 

.04 

(509  .04 

.04 

.04 

.05 

18  .05 

.05 

.05 

.05 

.06 

18B  .05 

.05 

.05 

.05 

.06 

.06 

13C  .05 

.05 

.05 

.05 

.06 

.06 

.06 

401  .05 

.05 

.05 

.05 

.06 

.06 

.07 

.07 

401A  .05 

.05 

.05 

.05 

.06 

.06 

.07 

.07 

.07 

401B  .05 

.05 

.05 

.05 

.06 

.06 

.07 

.07 

.07 

.07 

408  .05 

.04 

.04 

.05 

.06 

.06 

.06 

.06 

.06 

.06 

.06 

10  .04 

.04 

.04 

.04 

.05 

.05 

.06 

.06 

.06 

.06 

.05 

.05 

10A  .04 

.04 

.04 

.04 

.05 

.05 

.06 

.06 

.06 

.06 

.05 

.05 

.05 

10B  .  04 

.04 

.04 

.04 

.05 

.05 

.05 

.06 

.06 

.06 

.05 

.05 

.05 

.05 

The 

table  indicates  that 

the 

smears  of 

the 

BNCl 

matrices 

for 

chorion  proteins  range 

from 

4%  to 

7%. 

The 

smear 

of 

the  BNCl  matrix 

for 

proteins  2 

92  and  1SB 

is 

calculated 

to  be  5%, 

be  tween 

that  of 

the 

CC  (13%) 

and  the  NNC  matrix  (2%) 


Another  BNC  matrix,  called  BNC2,  is  defined  by 
M.  .  -  Zl  lt  Xi”Tj  <Xi-l*Tj-l  or 

'J  '  *  Oth.ITi,..  <2',) 

i»2,...,m-l  and  j=2,...n-l. 

The  BNC2  differs  from  the  BNC1  matrix  in  that  it  suppresses  the  of 

equation  (2-7).  For  comparison  purposes,  the  BNC2  matrix  for  proteins  292 
and  18B  is  presented  in  figure  2-7. 

Finally  ve  define  the  BNC3  matrix  as: 


Xi  if  Xi-l*Tj-l  '  VTj  “d  Xi+l*Yj+l 


(2-9) 


•  *  if  otherwise.  '  " 

i»2,...,m-l  and  j»2,...a-l. 

The  BNC3  matrix  for  proteins  292  and  18B  is  presented  in  figure  2-8. Table 
2-4  lists  the  smears  for  the  BNC3  matrices  for  all  pairs  of  chorion 
proteins  up  to  the  second  decimal  digit.  Smears  less  than  .01  are  not 
entered  in  the  table. 

Table  2-4.  Saears  of  BNC3  matrices  for  all  pairs  of  chorion  proteins. 


292 

292A 

292B 

609 

18 

18B 

18C 

401 

401A 

401B 

408 

10 

292 

.01 

292A 

.01 

.01 

292B 

.01 

.01 

.01 

6  09 

.01 

.01 

.01 

.01 

18 

.01 

.01 

.01 

.01 

.01 

18B 

.01 

.01 

.01 

.01 

.01 

.02 

18C 

.01 

.01 

.01 

.01 

.01 

.01 

.02 

401 

.01 

.01 

.01 

.01 

40 1A 

.01 

.01 

.01 

.01 

.01 

.01 

401B 

.01 

.01 

.01 

.01 

.01 

.01 

.01 

408 

.01 

.01 

.01 

.01 

.01 

.01 

.01 

10 

.01 

.01 

.01 

.01 

.01 

.01 

.01 

.01 

10A 

.01 

.01 

.01 

.01 

.01 

.01 

10B 

.01 

.01 

.01 

.01 

.01 

.01 

.01 

.01 

As 

can  be 

seen 

from 

table  2- 

■4  the 

smears 

of  the 

BNC3 

ma 

chorion  proteins  range  up  to  2%.  Those  for  the  same  protein  wary  from  1% 


24 


to  2%  compared  to  the  minimum  1/20^  =  .0125%.  The  BNC3  matrix  "filters" 
the  data  rather  severely  and  suppresses  diagonals  that  were  discernible 
in  less  restrictive  matrices  presented  previously. 

The  entries  of  all  matrices  defined  so  far  are  blanks,  asterisks 
or  alphabet  characters.  It  is  clear  that  in  a  quantitative  assessment  of 
diagonals  the  types  of  matches  and  mismatches  should  be  taken  into 
consideration,  matches  of  rare  letters  being  more  "significant*  than 
those  between  frequent  letters.  However,  visual  examinations  of  character 
matrices  are  not  elaborate  enough  to  take  the  nature  of  matches  or 
mismatches  into  account.  In  whichever  matrix  is  available,  the 
investigator  is  searching  for  long  diagonals  with  a  large  number  of 
matching  nonblank  characters  relative  to  the  length  of  the  diagonal. 
Thus  for  purposes  of  visual  examination  a  matrix  entry  may  be  reduced 
to  a  blank  or  a  non-blank  character. 

The  five  types  of  character  matrices  introduced  in  this  chapter  are 
conceptually  and  mathematically  related.  The  (i,j)th  entry  of  the  NNC 
matrix  was  defined  after  comparing  to  7j  and  their  next  (right) 
neighbours  and  7j+^.  Instead  one  might  compare  the  previous  (left) 
neighbours  and  Tj_i  and  construct  the  Previous  Neighbour  Considered 
(PNC)  matrix.  The  superposition  of  the  PNC  to  the  NNC  produces  the  BNC2 
matrix. 

The  design  of  various  character  matrices  to  reduce  the  smear  and 
enable  the  investigator  to  discern  existing  diagonals,  was  previouly 
called  "filtering*  of  the  data.  The  term  has  not  only  a  heuristic 
appeal;  for  the  NNC,  PNC,  and  BNC3  matrices  it  is  used  appropriately 
in  a  technical  sense  too.  Indeed,  we  can  consider  these  character 
matrices  as  CC  types  of  matrices  on  the  data  after  they  are  transformed 


appropriately-  la  perticular,  consider  transforming  the  sequences  {X^), 


t“l,...,m  and  {Yt),  t«l,...n  as: 


5t  ■  5*+1]  -•  m  1j*+J  s»i 


,m— 1 
.  n-1 


Then  the  NNC  matrix  can  be  thought  of  as  a  CC  type  of  matrix  on  the 
transformed  data.  The  transformations  corresponding  to  the  PNC  and  BNC3 
matrices  are 


and 


t-2 1 1 « « ifl 

s*2 ,  . .  .  ,  n 


pt-ll 

IVil 

it- 

*t 

Is  * 

Ys 

LXt+l-l 

■Ys+1- 

t*2, . . . ,a-l 
s*2, . . . ,n-l 


respectively. 

The  NNC,  PNC,  BNC1 ,  BNC2  and  BNC3  character  matrices  were  designed 
in  order  to  reduce  the  noise  in  the  CC  matrices  and  make  signals  easily 
discernible.  Of  those,  the  NNC  and  PNC  and  BNC3  matrices  suppress 
signals  as  well.  A  syllable  of  length  L  present  in  common  in  X  and  Y 
gives  rise  to  a  diagonal  string  of  length  L-l  for  the  NNC  and  PNC 
matrices  and  L-2  for  the  BNC3  matrix.  The  BNC2  matrix  does  not  suppress 
signals  but  does  not  allow  for  mismatches;  when  a  substring  is  common  to 
the  two  words  except  for  a  mismatch,  the  diagonal  corresponding  to  the 
syllable  carries  a  blank  character  at  the  site  of  the  mismatch.  While 
filtering  noise,  the  BNC1  may  be  thought  of  as  enhancing  signals  as  it 
does  not  allow  occasional  nonconsecutive  mismatches  in  a  syllable  which 
is  otherwise  shared  by  X  and  I,  to  brake  the  diagonal  corresponding  to 


■  J  <  **  ' 


■  ji’i  .  » •* .  l..** 

.3SYCC  YC  ICNVij 


14  J 

i?2 

-  a 

^iTF»»r  l?:.  -  I  ‘-hC  l  / *.. u  • r v'  ■! ^ •i'i ^ ^ :  £-1l  1  «i.  j  y  C>u-.‘iv,l)^  '.  *#/.r*J5 

•*»cu  I  OH  r  "»#«#p«wC>  **  I  I  v«#<  < ^ p  •  ■  i**.  i^CUcCiio  ■*  ■  I* 

*  ?k~ 

NL  » ' 7  4 

*$TFmFlFLC !C«CL\'v*NVFSVC«CuLGi.?:CL**tf.*«»C’GiCCLCv££j.>.  C -LSV^CZGYG^C.  -1  re 
CEu !  CMVMVuOEUP'VMii  ?  rtiVHww  v?  ’  I  vi^’.’Or  C  4 

292B 

nl-134  ' 

-  -  IrZFLPLC  !£MCLV0rvFCVCPCCLCLKCLf*MPACCCCCUCveCu'2*;  ‘'—l  »;  OC»C  KYG—igM.lf  TC 
•ZEu ICHVflVwdEwP^^wTT TVt*u4«p  S  twMVO^Cw^p^puuwvis  ^  v<«  'jCuwC^ l  '  - 

<r  0* 

^tf^lflcioaclvonv^cvc^ccl-apacococlcvccuclcalcve-:  :•«:*»: w ^ o T-;ci -3 vco 

hvmVAGElPVmCK.  7«vC"iCiVPvl  (1m  vC/pC  ‘1P*»  f  hOulV'j  I  PCkCCuC'*  .'4V  1  j  j  i4y  , 

»  -i 

•  'L *  ’  26  . . . _ 

*J  S  *  Li»  \  ■♦(#»••*.  1.  .  W!J  V  ^  >  *«i>.  y.  !j  '.  LrL  «#l#  J  C^«»  ‘y  j  va'4!»  V  !  VY  T  ?  ' »  M  -  —  ■  £  f  '4  •  ‘  '#  »  U**  “  fc*V 

"■  I  **V(4*4«i  ■*  P  l  I  I4M  ^V4?6w  •  ***4  4<*4<4  -.  ■  S  i  MCmC  ;4'4W  *4%  W'-aMU  l  '»  « 

•  :.o 

f*u  *  *  2  t 

MiTFoFLLL’^OACU  IOSWSyGCCCCCCCLGCVCCLi'i  CuL'lVOCL'l  .■£•: TC^CLGEVCil  TC  ICnva  - 
urVdCKTAVCu^P  I  I  CmvuFuu  T-wmmCC  v i  ZttCfcCiluCuCuCilPu  i  . 

li3  J 

nsTFpFi.LLCPGPCL!G3vYSYCi:ccccccLc.:Ycct.CYi;uL'.:Ycc;.CYEc«i:YCpct.££YC£Tcrcx  - 
CELPVPU*TPVG£av,PI  ICAVCFCCTPUBPCCVSIPCaCCGCGCCCCSGIY  , 

401  ; 

ni_«I73  ’ 

«NTKSIL:t-P3flt.rrriISavC0CUCauCPCLSaC!;SCGGCDCl.-GC»LC'.  GaC:CEICL:CCLEASYGGS  . 
$aSAVPPvCi.CvaSeNiiv£CCVuvaCi<LAAL£7AOv£GvAPToCACv  imyGCCSCavcI  rsECCrCGLL 

LCYE£Y£Gra.£YCGV'CL£CCCCC£LA'/i.  - 

4  0 1  P  1 

Nt»«l74 

pnTKS 5L :UCPSPLM : 03PVG3CLGauGPGLC?CCGCGGCGCWCG»UGYCaC] GE I CLuCGLEPiYCG  .  . 
PSPSPVPPvGLCVPSENlYYgcCYCV.iCLuAPLGTaOVcCvP  ATPCPiiv  I  nygCGOCavC  I TSEGGYGC1  . 
GLCYEGVCCYCLGY£CYGLG£C££GCG*YL 

4-<J»B  -“>. 

Ml.4174 

rlNTKS  IL ILCPSPLII I  a$»VCaCLGau£P-:L£aC££CCCCGCVCCat.CVG~£ J  GE I CLGiCLSal V55  u V 
uiaSxVPPvGGGvaSENnvEGG  »GVaG«CP'‘LGT>4GVEG  i  FA  TuGAG  v  (  nv£C££>u»«VG  J  TseuG  Yog  - 
>»LuYEu  yglYCLgYug  i  CL  uCCGG^C^a  i  L  • 

4  04  1 

ML»»70  I 

rtVPK»»LLiC«$ALF'*l<i:5aLCiiCLav-.i«PCLCYECAUNGRLiCCCCC ; SPxuELHHiYCGCLCvaaw^p  -• 
GLCVPSEMPYECCVPvpCNLPE'LGYaO  V6CVPPTPGSCV I  NYCCS&GplGI  T aESCYCaC  I  iYOCLit  'i 
LCLCYCPClCYKCYCLCCCCCCCGaL 
.10  „  _  _ 

laeKPLL  ICa3ALFYG3oo3CCLCvS  V  TCLGYCGaUNCSLCCCCCG  IPPaaf LHAtYCCCLCVPSPSa  -  = 
GLCypjENPVEGCvevaGHLPFLGYPGveGvFPTaCaGV  04YGCGHCaLGITw£SGYUaG:GYSGLGL4  •  C 

*  1  GYLC  YGL  «444iC*<%  «kL  _ 

1  OP 

—  — 
nnpKPPL I C aSPLFVQSpLjGCLCVSvpcucyOcpuhCRLCCCCCC ! APaaEL AA3 YCCCLCvojasa  a; 
<J^gVP|EHPYfC^V|VaCNU,FLG7PGVECyFPrpCPGVINYCCCWcALCiT.«eaGYGPGiCYOGVUL  .C 

1 08  U**l"a‘  ** 

Hu-««a 

,4PPtCPLLrCP3AL?VgSPLS(3CLGVSVTCWCYOCP«NCSLCCCCCGraP«aELaASYCGCLGVPSPSP  AS 
CLGYpS€MPY£CCYEVvCHLAFLGTmGyECVFPtPCPG V 1 HYGCCHELGI T mCRGYCxC i C YECLCLu  C- 
ICYXCYGLCGCCCGCGau 


2-1 


Proteins  encoded  by  the  fourteen  chorion 


genes  under 


a,4l?\*4*U,4W***U  HI*.*!*#!.  ^••/•••Ijmto.'aiftJjUfi.'Ufttoi.XWftftto*  4  !«*•>•••••*  |4*toU  **?*%'.,  a.*!-**'..-. 

!)*•, 


i 


«  * 
»% 


:3 

i'J 

!a 

!« 
UN 
■  3ft 
«* 

* 'to 

•:n 


.w 

«»* 


i& 


I 

«»to 

«•«« 

:'i 

3 


??S 

IS 


**t 

\r 

*-c 

«i« 


•*to 

*6* 

M4 


•  /to 

-I'M- 


& 


•§ 

$ 


as 

Hs 


:.*ri 

:« 

:iO 

!*3 


1 1« 


S!S 

■  iw 

sir 


<?s 

:i4 

-a 


i« 


rl  l 


*  * 

6  C 


w 

1 

•  I 


c  e  c 
c  6  e 


c  i  (  t 

€  4  C  C 


4  i  a  « 

64  4 

as  4 

64  6 

a  i 

4 

64  a  4 

^  4 

6 

64 

4  ^6 

64 

4 

6 

6  —6666 

see 

e 

V  ft 

C 

f  C  4  *■  | 

t  !  *  S  3 

iS  t 

60  <2 

22 : 

a 

* 

64  4  6 
-6  6  6 

6 

6 

t 

64 

:  i 

64 

64 

t 

ft 

6  6*  v  '(  > 

6  — m  w  i*  to 

444  CCW4£ 

«L4 

66  4 

06*6 

6  6 

wc 

64  6  6 

6 

% 

6 

64 

4  6 

64 

6 

6 

6  —6  6  4  6 

w 

6 

V 

6 

u 

6 

C  4  4  44  44 

64  4 

OC  6 

64  6 

4  4 

4 

64  4  6 

4 

a1 

Cfi 

4  4 

64 

6 

4 

6  64  6  6  6 

e  c 
c  a 


4  41 
4  4 


4  4 
4  4 


84  i«  O  88  4  88  8  88  1  1  8  *  8  88  8  8  8 

4 

64 

4  4  64 

8 

c  c  c 

4  6u  to  4  to 

<i  w*sV*% 

to  toto  to  to  to 

!tj222!2S2S  Si  S  2SS  S 

6 

to 

3 

t  «  « 

•  toto 

i  S 

♦ 

«  «  •  «Cktt  cc'a  Ct!~i  ii~t  I  8  l8  <8  8  8  «k 

4 

04 

4  4  04 

4  4 

4  64  4  6  6 

tl(CCS(£SQB6CCilC  4*  6446  1 

4 

64 

4  6  66 

4  6 

6  66  6  4  6 

T 

s  •  «  oa"aa  u'e  «L«  ali  <  «  Lc  as  a  •  1  1 

ft  ft  ft  ft  ft  ft 

6  4  4  64  64  64  4  06  6  64  6  4  6  4  6446  6 

*  ft  * 

4 

6 

ft 

04 

64 

ft 

4  4  04 

6  6  64 

ft 

4  4 

6  6 

ft  «* 

4  64  4  6  4 

6  66  6  6  6 

ft 

ft 

a  «  «  e»L«  ciLc  caL8  i  8  Li  cc  i  t  4  1 

▼  »  T  T  T  ft 

« 

04 

4  6  64 

6  4 

6  66  to  to  to 

4  4  4  64  64  64  6  68  4  64  6  4  4  4  64  6  4  4 

4 

at 

4  6  04 

4  6 

4  60  4  4  4 

r»  a  <  C6l'«6fce'‘s>est<^tfi'"8f  •  «  S  ycs  s  a  8l 

4 

44 

4  6  64 

6  • 

4  64  6  4  6 

iitamu  «’os  8  8C<  <8  sues  • 

4 

08 

4  6  64 

4  4 

6  66  6  4  6 

81  88X8  88  8  (8  8  88  8  ««*  S  88  8  8  *  ‘c 

4 

*  08 

4  6  64 

6  4 

6  66  6  6  4 

8  8  8  K  88  88  8  88  1  88  8  IS*  C  88  8  8  *  *« 

k  * 

*  64 

^(*888 

%"*a 

ft 

6  66  4  •*•  6 

(!(  V  •  a 

« 

ft 

ft 

•  «• 

ft 

_L_ 

|  IlCSUiSSiSi^C  88  8  88  8  8  « 

04 

4  «  66 

,c 

to  to 

_ U 

.6  •.  .6  _v 

to  toto  to  to  to 

i:  5  Ve  i  s  * 

4  4  4  66  64  66  6  64  *2  66  4  *4  6  6*  6666  6* 

to 

toto 

ft  ** 

to  <•  — • 

C  6 

to  to 

If  %6  6  6  6 

to  toto  6  6  to 

4 

66 

6  4  66 

8  888 

6  'to.  to  *.  to 

4444404644644644  44  4  64  6 'a  6 

6 

64 

to  4 

4  66  6  4  6 

♦ 

*ft  ft 

4  ft  ft 

♦  V 

ft 

ft 

ft 

ft  ft 

ft 

ft  ft 

ft 

ft 

ft 

ft  • 

*  ft 

ft 

4  4  4  04  44  44  4  04  4  64  4  (4  4*  8f  68  4  4  *  *3f 

*4 

66 

4*  4  68 

#4**4 

6  6—  to  to  to 

4  4* 
4  4 


«  C 
4  4 


t  I 


4  4  4  04  04  04  4 

04  4 

64  4 

ft 

V 

* 

8 

ft  ft 

04  j0  4  4 

• 

4 

ft 

1/ 

4*  4  64f*4,*"4  *6  06  4  4  6  6 

f 

ft* ft* 

ft* 

ft  ft 

ft  ft 

8  4  4  08  68  44  8 

04  0 

04  4 

4  4 

4 

04  «  4  *  *4 

#4 

‘“4 

4  6  66  "w  6  *6  66  0  6  6  6 

ft  ft 

ob  a^a  « 

ft 

9 

ft  ft 

ft 

*  ft 

11  1 

4  606  6  6  6  66  6  6  to  6 

4  0  4  08  08  64  4 

04  0 

04  0 

t  4^ 

0 

4 

04 

ft  V 

ft* 

ft  ft 

ft  ft 

a*»eaesa  «  to  « 

04  4 

04  4 

• 

4 

84  4  4  4 

^  ft  ft 

4 

ft 

04 

4 

ft  ft  ft*  ft 

SIS333S 

e  e  e 

2  S 

as 

ft 

ss 

g 

t 

SiS  **! 

ft 

S 

ft 

2 

i  ! 2  Ti  'i  ajSSS 

to  to  to 

ft  ft 

ft 

ft  ft 

6  C  C  6  'T 

ft  ft 

4 

1 11  2  2  2  S 

23 

2  S 

IS 

S 

*ft  9%  i 

toto  to  to  6 

I 

2 

"s  S  2  S  S  1 1  a  8  s  v. 

«  c  e 

,  • 

f  C  6  6  C  6  # 

4  0  4  64  84  68  8 

64  4 

04  4 

4  4 

4 

04%  8  4 

4 

f  64 

4  4  68  0  4  4  .  66^0c6t*;  6 

aVe|u  88  88  8 

64  4 

64  4 

4 

64  0  4  4 

« 

ot 

4  0  44  6  4*  6  ^--:Vto  to  6 

sVs'a  88  88  8 

68  8 

08  4 

*•* 

4 

64  4  4  4 

4 

04 

6  6a  *•  C  to 

446404  4  66  4  6  4 

8C«  4  ea  to  to  « 

68  4 

OB  4 

88* 

4 

44  4  4  4 

• 

04 

4  4  04  4  4*  4  664&4C4W4  4 

♦  t 

*  T 


•  r  i  *  _ ;  %  i  *  ;  j  ,  * 


" .  2-2.  CC  matrix  for  proteins  292  and  lG'i. 


29 


in  #  ••  .  «  ♦  •  •  i  ! 

m 

8\ 

!  % 

n  *  1 

Iff  .  . 

IS  V  .  •« 


V  \  \  \ 

I  *  I  « 

I  I  I  « 


V.\\\ 

WV-' 

•>•  s  l«  l. 

\\V. 


:<« 


i  «*.  \  V 


*V 

v  *  * 

4  m 


% 


.«  ee*4 


«  •  «  •  • 


•  «  «  •  « 


t  VA 


mrwmjLLamCLin**r&nCUCA0S*XLCrCCLC*A**€y<4a,Vr9at€tommmmr~mwi~m*r~'ii'~j~’-*—^*^' - y  { 

fmiiniil ni-M-mlii m-r  ! ivtti — ‘rt"~  ’*"■•****»— l*‘,~l  .i— miwf«!n<Mm«'i»w»"'iw*«''’‘1«:’*11 


•wawvrt  » 


Fi2.  2-5.  N?.C  matrix  for  proteins  292  and  113 


.  «_  ■  .  •_  4  -  4  - 


JO 


31 


;  7  •  t  ;  9  i  ; 

I.  ^  .Nft.'Mal.  -4  +m?  U'-.****,'  *4*-  **»■’  ,w<  .•  •  f*  .-•»••  •  r  ■  • 

«•!»«<  Ul«*^kt4«4>  «■«#  MMWA**  . 1.  ..  .  .  .  a.^, 

T'. 

3  *r 


4 

•U 


•  « 


:?> 
■  7* 


Pi 

’5 

* 

r«« 

rs# 

IS 

is 

«><; 

ift 

•S 

»*▼ 

is 


^i» 

•«s 

»(4 

4I>« 

•*** 

m'A 

•  .*4 

*• 

iv 

:S« 


V4\  \  \ 

e  i  4  a  a  c  a  « 
t  v,  v  u 

I  «  I  « 

w  u  t  c 


•VA 

'••••w  4.  4c 


«  C  4  C 


•  ft  • 

\.  \  \\ 

l  «  I 


* 

l  ft 


w. 


4.4.4.  4.  ‘a 


4t6* 


%  <4 

«  • 
4  4 


C«AV* 

C4  '  «C4r« 


«.  s  <1  4 


-iv 

£ 


* 

*  *  * 

•  4  « 

*  »  V 

V 

**.  V 


*1*  * 


.-! 

i«i<i 

!:ti 

\v* 


<h| 

lift 

i»t 

M*W 

» l  <4 
tim 
»?tc 
i.ii 
14  .4 
I  .44. 
I  S 

3 

i-n 

:;i i 


w.V-#*  *4 


••••••  *.  *. 


i 


J _ ! 


%  *‘  *i 


•  •- 


-  *V*. 

<  s.  /.  i  , 

<*»  •;  4  *# 


*:•*«>  v  **♦•!  • 


l|1444|WMIJf< >44iH4l 43 W«.>ln  MWl *  • 


Fig.  2-7.  3"C2  matrix  for  proteins  292  and  15 


.v-jj.‘i4  ii  ft  ;Vi^2tATUsitr‘iA£!>.r'!. 


lit-  t 


•  I  •  1  A  A 

•  .>  uw;»4«|  ;  ;  14*..*#  i«t ;  f  i«.  < ‘*»J  ?*«•.  *4-7  W4. 


I  I  I  ( 
w  w  k  fc 


4  I  I  I 

k  k«  k«  s 


k ,  \  \  V 


•  t  4 


mft^wa  tm**i*mi ^  *w  *‘A.i  • 

.  2-3,  HNC3  matrix  for  proteins  292  and  133. 


33 


3.  STATISTICAL  PROPERTIES  OF  SMEARS  OF  CHARACTER  MATRICES 
The  CC,  NNC,  PMC,  PNC1 ,  RMC2  and  RMC3  character  matrices  were 
introduced  in  chapter  2  as  Graphical  tools  to  explore  string  data.  This 
chapter  derives  sone  of  the  statistical  properties  of  the  smears  of  the 
CC  and  MNC  natrices.  The  statistical  properties  of  matrix  smears  depend 
on  the  model  under  which  words  are  written.  The  model  the  most  tractaole 
to  work  with  is  the  independence  model.  The  independence  model  supposes 
that  words  X=*(Xj , . . . .X^)  and  Y*(Yj .Yj, . . . »Yfl) ,  written  in  an 
alphabet  of  s  characters  (a^,...,as),  are  independent  sets  of 
independent  observations  distributed  as: 


Pr(Xt=»ai)=pi 

i=*l,...,s  and  t*l,., 

i  •  #3 

and 

Prdt-iJ^i 

i=l , . . . , s  and  t=l , . , 

i  «  i  3  » 

Propositions  3-la  and  3-2a  derive  the  first  two  moments  and  the 
asymptotic  distributions  of  the  smears  of  the  CC  and  NMC  matrices  for 
two  different  words  X  and  Y.  Propositions  3-lb  and  3-2b  derive  the  same 
results  matrices  of  one  word  X. 

Promos  i  t  ion  3.1a.  Let  S(X,Y)  be  the  smear  of  the  CC  matrix  of  X  and  Y 
defined  in  (2-2).  (Jnder  the  independence  model, 


ES  ( X ,  Y )  =  a 


(3-1) 


where  a  is  the  theoretical  smear  defined  in  equation  (2-3) 


(n-1)  ^  p,.qk  +  (n-1)  ^  pk  qt 

VarS(X.Y)  =  — - n~ra~1  c2+  -  (3-2) 

-  -  mn  nn  mn 


1  ‘  *>  1  7  2  ,11,' 

~  +  -  /  pk  qk  -  (  -  -  -  )  cr“  as  m-»®  and  n-» ». 

k=l  k=l 


34 


If  (r./n)  -4  X  as  n-4®  and  n— >®, 


>jn(S(X,Y)-  9)  N(0,V) 


(3-3) 


=  /  Pk<lk2  +  x  )  Pk2°-k  "  (1+X)a2 


(3-4) 


Proof.  The  sr.ear  of  the  CC  -atria  can  be  written  as: 


S(X.Y)  * 


/  /*<-i'V 

i*l  j*l 


(3-5) 


1  if  V*j 

0  otherwise. 


Toe  re  fore . 


(3-6) 


"StX.Y)  =  -4(Zit\' j)  *  PrfX-Yj)  =  ,  p^. 

i«l 

the  TPS  of  Che  above  equation  being  the  theoretical  sr.ear  defined  in 


equation 


To  connate  tae  variance  of  S(X.Y)  we  evaluate  variances 


and  covariances  ar.ong  the  p  variables. 


"a r t  <  a  ,  ,  Y  .  '  *<r(  I-?) 


[f  i*u  and  j  *v  ,  Cov  <  b  (  X  i  ,  Y  )  .  4  ( X.A,  Tv 


)  )  -  0 . 


If  j 4v 


Covtftt^Yj)  ,^(Xi,Yv))  *  ,  pkqk2  -<r 


If  i#u 


Cov(#(Xi.Yj).#(Xu.Yj))  =  }  ?k-q,.  -a:. 

-  k-1 


..ence 


b-b-V arS(X.Y)-  >  )  Va rfCE^Yj)*  )  £  £  Cov( , Yj ) . # (X± . Yy) ) 

i*l  j*l  i»l  j*l  val 

}+v 

a  a  a 

+  )  l  )  Cov(0(Xi.Yj).,KXu.Yj)) 

i-1  j-1  b-1 
i^u 

s  s 

=  r.ao(  1-u)  +an(  a-1 )  (  /  -a")  +  nn(n-l)(}  ?•.” q^.-  o“)  . 

i=l  k*l 


Equation  (3-2)  is  obtained  by  dividing  both  sides  of  this  equation  by 


(nn)“ , 


To  derive  the  asymptotic  distribution  of  the  smear  of  the  CC 
matrix,  note  that  equation  (3—6)  presents  S(X,Y)  as  a  rJ-stat  ist  ic . 
Therefore  if  (m/n)  X  as  a  »  and  n  -»  the  asymptotic 

distribution  of  S  is  normal  (see,  e.g.,  theorem  9,  p.364,  in  Lehmann  [9], 
In  particular 


>|n(S(X,Y)-  a)  -»N(0,V) 


for 

v  =  °’oi”J‘^ain’* ' 

(3-7) 

where 

aio2  =  v«*io<V'  ^oi2  *  v“*oi(V 

and 

■!10(x)  =  E0(x,Yt)-o  =  ?r(Yt*x)-o 

lr)1(y)  »  E0(Xt,y)-o  *  Pr(Xt«y)-ot 

s 

1  2 

'Tence  *  /  nv(a^-«)  , 

k»l 


s 

a0l”  =  Z  ‘Ik^k 
k*l 

and  (3-4)  is  obtained  by  substituting  the  above  expressions  for  o^q  and 
into  equation  (3-7). 

Denark  that  if  a— » ®  and  n->  <®  so  that  (r./n)  ->  k,  mVarS(X,Y)~  V, 


36 


i.e.  the  limit  of  the  variance  is  the  variance  of  the  asymptot  ic 
distribution  of  S(X,Y). 

The  asymptotic ‘result  of  proposition  1  may  also  be  obtained  by  the 
6  method.  The  o  method  is  used  to  prove  the  asymptotic  result  in 
proposition  3-lb  which  can  be  may  also  be  proved  by  the  1-sample  t7- 
statistics  theorem. 


37 


For  notation*!  convenience  let 


1  pk2* 


(3-12) 


To  compute  the  variance  of  S(X)  we  evaluate  covariances  among  the  0's. 

Cov(0(Ii,Xj)<0(Xj,Xi)  -  Vars^X^Xj) 

If  i#j  Var0(Xi,Xj)  =  r(l-r). 

If  i0u,i0v,j0v,  and  j0v  Cov(p(Xi#Xj ) ,0(Xu<Xv) )=  0. 

If  i0j ,  j^k, i0k  Cov(0(Xi,Xj),0(Xi.Xk))=  EfCX-.Xj )p(X.,Xk)-T2 


Pr  (X 


i*Xj-Xk)-r2  -  l  pk3-  r2. 


Hence , 

m4VarS(X)  =*  Var0(X..Yj  CovWX^X. )  ,0(Xj  .X^  )  + 

i*j  i*j 

+  }}}  Cov(0ai.Xj).0(Xi.Xk)  +}}}  Cov(0(Xi,X.),0(Xk/Xi)  + 

i  j  k  i  j  k 

i0j  i0k  j0k  i*j  i*k  j^k 

*111  Cov(«‘<Ii*Ij)*#^k'Xj)  +  ^  Cov<#(Xl.Xj).#(XJ.Xk)- 

i  j  k  i  j  k 

i#j  i#k  j0k  i^j  i^k  j#k 

s 

*  2m(m-l)t(l~r)+  4m(m-l ) (m-2 ) (^  pk3-  r2), 

k-1 

and  equation  (3-9)  is  obtained  by  dividing  both  sides  of  the  above 

4 

equation  by  m  . 

To  derive  the  asymptotic  distribution  of  the  smear  of  the  CC  matrix 

by  the  5  method,  let  be  the  count  of  alphabet  character  a^  in 

X*(X^ , .  .  ,Xa) ,  and  let  p^*(M^/m)  be  the  frequency  of  characte-  a^  in  X 

and  p  a  (pj»...,p#).  Then  is  multinomially  distributed  with 

*r 

parameters  m  and  p  *(p^,...ps).  By  the  normal  approximation  to  the 

multinomial  distribution,  'Jm(p-p)  -4  N(0,D  -ppT),  where  D_  is  the 

P  P 


diagonal  matrix  with  the  entries  of  vector  p  along  the  diagonal,  and 

consequently  0f-p3=op(i4»). 
s 

Let  g(p)*^  pk2.  From  equation  (2-4)  it  follows  that  S(X)  *  g(p).  Then 


g(p)*g(p)+(grad  g)^.(p-p)+e  Bp-pB 


with  e  -»0  as  p— ■‘p. 


ititnting  grad  g=  2p  and  multiplying  the  above  expansion  -by  ^m  we 


obtain 


'ja(g(p)-g(p) )  *  'jm  2p2'(p-p)  +  Op(l). 

Therefore,  ^m(g (p)-g(p) )  is  asymptotically  normally  distributed  with  mean 

T  T  T  t  2, 

0  and  variance  4p  (Dp-pp  )p  =  4(p  Dpp-(p  p)  )  which  is  written  in  the 


entries  of  p  in  equation  (3-10). 


Proposition  3-2a.  Let  S(X,Y)  be  the  smear  of  the  NNC  matrix  for  words  X 


and  Y  .  Under  the  independence  model. 

ES(X, Y)=  a2, 

2  ^ 

and  as  m—>®  and  n— >«  subject  to  m=o(a  )  and  n*o(m“). 


(3-13) 


VarS(X.Y)-  ~  {2a2(^>  pkqk2)  +  (J  Pkqk2)2) 


(3-14) 


*  {  ««2<j  *  <2  ^V2>  - 3  <i  *■  °4- 


If  (m/n)->  X  as  a  ®  and  a  -4®.  the  smear  is  asymptotically  normally 


distributed 


\|m(S(X,Y)-  a2)  ^  N  ( 0 ,  V ) 


V  *  lim  mVarS(X.Y)  as  m— >®. 


Proof.  The  smear  of  the  NNC  matrix  for  I  and  Y  can  be  written  as 


39 


S(X,Y)  = 


m-1  a-1 

i»l  j»l _ 

(m-1) (a-1) 


(3-15) 


where 


id  X  i  *Y  I.  .)*  *  *i“^j  tn<*  ^i+l*^j+l 

,Ai+l ' xj ‘  j+l;  0  otherwise. 


(3-16) 


Therefore , 


ES(X.Y)*Ei(Xi.Xi+1;Ij  ,Yj+1)=Pr(Xi=Ij  .Xi+i=Yj+1)  =  (pr (X^Yj  )  )2  -  <? , 


and 


VarS(X.Y) 

1 


m-1  n-1  m-1  n-1 


(a-1)2 (a-1)2 


^  ^  ^  ^  Cov(4(Xi.Xi+1;Yj  »Yj+1)  »i(Xu»Xu+i;Yv»Yvi.i) ) . 


i-1  j*l  u-1  v=l 
To  evalnate  (3-17)  let 

m-1  a-1 


(3-17) 


Vij  -  }  }  Cov«KXi.Xi+1;Yrrj+l).idu.Xu+1;Yv.rv+1)) 


(3-18) 


lt=l  1-1 

and  rewrite  equation  (3-17)  as 

VarS(X.Y)* 


(m-1) 


m-1  a-1 

2(  x)a  1  Ivij* 

(n  2)  i-1  j-1 


(3-19) 


We  evaluate  for  i=2,...,m-l  and  j=2,...,n-l. 

By  independence,  if  |i-ul>l  and  lj-vl>l 

Co(i(X^,X^+i«Yj  1  ^^^ii'^u+1 ' ^v' ^v+1  ^ 

Therefore  the  values  of  u  and  v  which  contribute  to  V^j  are  such  that 
either  li-ulll  or  Ij-vlll  as  presented  in  figure  3-1.  We  now  compute 
their  contributions. 

If  u»i  and  I v-j I>1, 

Co(i(Xi.Xi+1:Yj.Yj+1),i(Xu.Xu+1;Yv.Yv+1))- 

*  E<|(Xi.,X >^j  +  l^^i'^i+l;^v'^v+l^"’a 


Cov(«(Xi.Xi+1;YJ.Yj+1)^(X11,-u+l!Yv,Yv+1))  = 
and  if  v»j±l  and  li-ul>l. 


k»l 


(3-22) 


Cov($(Xi.Xi+1;Yj.Yj+1).«{(X11,Xu+1;Yv,Yv+1))  =  a2(  ;  pk2qfc-a2) 


(3-23) 


k«l 

As  indicated  in  figure  3-1  for  fixed  i=2 ,  .  .  .  ,  r.-l  and  j=2,...,n-l,  V—  is 
a  sum  of  3 ( n-4 ) +3 ( n-4 ) -9  nonzero  covariances.  Of  those  all  out  nine  can 
be  computed  fron  formulae  (3-20)  to  (3-23),  the  nine  terms  being 
Cov(«|(Xi,Xi+1;Yj.Yj+1).4(X11.Xu+1iYv,Yv+1)) 
for  u  and  v  such  that  lu-ilil  or  lj-v|£l.  These  covariances  can  be 
derived  similarly.  However ,  we  do  not  need  to  compute  them  explicitly  as 
their  total  contribution  to  V-.  is  0(1)  compared  with  that  of  the  other 
contributing  terms  which  may  be  computed  from  formulae  (3-20)  to  (3-23) 
and  is  of  order  O(a)+O(n)  as  can  be  seen  from  (3-24)  below.  Thus, 


Vij  = 


)  Cov(i(Xi,Xi+l5Yj.Yj+1,<|(Xi,Xi+1jYv.Yv+1))- 

Iv-j I >1 

+  l  >  Cov<  »^i+i  J Yj  , Yj  +  i>  , <|(XU, Xu<f  i  j  Yy ,  Yv^i )  )  ♦ 
u=i-l  Iv-j | >1 

-  I  Cov(4(X.,Xi+1,YrYj+1).!(Xv.Vl{Yj.Yj^1))* 

I u- i I > 1 


*  2  2  c««^i.ii+lj?j-Vi,,,,(s»'Vi'Yv'Vi1+  0(1) 

I u-i I > 1  v=j hi 


Each  of  the  four  sums  in  the  above  equation  can  be  computed  by 
substituting  formulae  in  (3-21)  to  (3-24)  for  the  covariances  in  the 
four  sums  above  to  obtain 


Vy=  2(n-4)a“(  2  P^qk“  “o2) 
+2(m-4)a“(  )  ?-  2qk-o2) 


(n-4)((^  P'*^”  ^ 


(a— 4)(()  o,. 2qv)“-c^) 

t  1  s 


0(1). 


(3-24) 


42 


7 


m 


The  value  of  when  i=l  or  j*l  is  slightly  different  from  the 

expression  in  3-2.  However, 

tn-1  a— 1  n— 1  a— 1 

v*  v*  2  ?  7 

2  2  ^ij  order  0(n  al+Olan")  whereas  2  ^lj  and  )  ^  n 

i=2  j  =2  j-1  i-1 

Hence,  as  =  and  n— subject  to  n=o(a“)  and  n=o(n"), 


a-1  n-1 


VarS (X , Y)  - 


(n-l)*  (n-l)~  . 
1  ,  2 


2  L  L  '  ij 


i-2  j-2 

I  ,  ^  ,  V  ^  ^  ^  < 

-  —  (o'(2/  qv-o“ )  +•  (/  p,,‘q!.)  -a’’) 
+  —  (a"(2;  o,  a.."-a“)  +  (/  o.,a.2)  2-cr) 


(3-25) 


which  equals  the  ?JIS  of  equation  (3-14). 

To  derive  the  asymptotic  distribution  of  S(X,Y)  let  and  be 

tr.e  counts  of  the  2-letter  syllable  a-a-  in  X  and  Y  respectively. 

■*  J 

To  rna 1 ly , 

n-1  n-1 


;:ijV  rij(Xh'-Vi> 

h=l 


for  I  i  j ( *  *  -  )  chc  indicator  function 


' J  i j  *ij (‘h,Yhfl^ 

k«l 


(3-25) 


T  ,  .  1  if  x=a-  and  v=a; 

I;  •  (x,y)=  .  1  •  J 

lJ  0  otnerwise. 


(3-27) 


ote  that 

s  s 


/  /  ‘ij  =  n_i 

i*lj“l 


s  s 


L  L  'hr  a“l 


i-lj-1 


and  let 


5.  =  -ii- 

?tj  n-1 


and 


Nii 


(3-2S) 


•ij  n-1 

be  the  frequencies  of  syllable  a^aj  in  X  and  Y.  The  following  notation 


is  useful: 


M*=(Mn, . . .  •  • » .  !sl.  •  •  • 

..... *^s ....  » ...  .Nss ) 

PT=(?11 . ?ls . Psl . Pss) 

T 

q  3(qii.....qis..--«qsi.....qss) 

p  =(Pll . Pis . ?sl . Pss> 

<1  °-n  *  •  •  •  •  °-l  s  •  •  •  *  a-sl  •  •  *  *  *  ^ss  '  ’ 


(3-29) 


(3-30) 


(3-31) 


la  this  notation,  the  smear  of  the  NTJC  character  matrix  can  be  written  as 


\  \  ««  XI 

L  L  'lij‘  ij 

i*lj*l 

S(S.P-  IT  •  * 


/  .  ,A  I 

( 3-34) 


Let  g  ( p.  q)  *  pA  .  q. 

r.y  the  differentiability  of  g(  ,  ), 

3 ( P *  q)  *  3(p,q>  +( grad  S)T.  (  pj_|pj)  +  £mn‘ ^  ® 

where  enn  ->0  as  9  “*P  and  3.  -*  q.  Substituting 

,  .  J  .3  3 0  3 .  ,  ,  f  q 

(grad  3 )  -  <3pii . 3?ss . 3qn . 3qs#)  3(p,q)*  [p. 

into  the  above  equation,  we  obtain 

s(x.Y)=  s(p.q)  +  qT-(p-p)  +  pT.(q-q)  + 


(3-33) 


Vote  that  the  distribution  of  M  and  N  is  not  multinomial.  The  asymptotic 
distibution  of  p  (and  q)  is  provided  by  the  following  lemma. 


Lemma  3.1.  Under  the  independence  model,  p  defined  by  (3-31)  and  (3-2S) 
is  asymptotically  normally  distributed 

'Jm  (9  -  Ej)  ^  N ( 0 , 11 ) 


*■  i.jju.v"  Pi?jPu6iv  +  PiPj5iuSjv  +  PiPjPv5ju  "  3PiPjPuPv 


(3-35 


( S ^ j  is  the  usual  Kronecker  delta.) 


Proof.  Let  eis(c- 


,c$s)  be  azi  arbitrary  vector  of  s 


constants.  3v  v  3  —  29)  and  (3-25)  , 


n-l  s  s 


«T*  ■  /  L  'ij-'ij  'III  =ijWW 

i*l j “1  a“li=lj*l 

T 

Mote  that  c  M  is  an  (n-l)st  sun  of  1-dependent  variables. 


Var  (cTH)  m  i  l  ll  cijcuvCov(”ij  -;,av) 


1  j  u  v 


cv<r.iy\v>-  cov<  2  iij'WiW  Vb’bn"' 

a=l  ,3=1 


If  I  a— ;3  I  >  1  by  independence.  Cov( I ^ ■ (Xa,Xa+1 ) , Iuv (X^ , Xn+^) ) -0 . 


covciy ,::uv)»  l  Covdij(xa.xa+1).i.iv(xa_1.xa)) 


+  2  *-ov^  ij  ^a'^a+1^  •  *uv^a'^a+l^ 
a=l 
n-2 

+  /  Cov( I ^ j (Xa,Xa+j) , Iuv (Xa+j ,Xa+2 ^ • 

u=l 

The  covariances  in  the  three  sums  above  are: 


(3-36) 


Cov(  (1^  (X0,Xa+1)  .  I^tX^.X,))- 

Pr(-va_i  =  a^,Xa=av.Xa=ai,Xa+i  =  av)-?iPjPuPv 
=  ? i? j Pu^ iv~P iP j PuPv 


and  sinilarly, 


v.ov(  j.  i  j  ( --c*-'-3+i )  #  -uv^  •'■a*'va+l ) )  "  ?i?j^iu^jv 

and  Cov(Iij(r»a,Xa+1)  ,IuvC’:a+1.Xa+2))=*  PiPjPv5jU-PiPjPuPv- 
Substituting  the  above  formulae  for  the  covariances  into  (3-35)  we  obtain 


CovCIy  ,\v)=  (n-2)  ( p j  pu5  iv+p  £P j Pv& j U"P i? j  ?uPv) 
+(a-l) (?iPj5iQGjv-piPjpupv) . 


(3-37) 


Therefore  CovC'^j  ,M„v)=G(n)  and  so  is  Var(c^M) .  According  to  the  In¬ 
dependent  CLT  (theorem  7.3.1  of  Chung  [3]), 


cTa  -  cTsii 


N(0, 1) 


and  consequently 


\|vareTM 

^a(?  -E$)  -  'KO.Z1)  . 


the  parameters  of  the  distribution  given  by  (3-34)  and  (3-35). 


Lemma  1  makes  the  6  method  applicable  to  the  expansion  (3-33)  and  the 
asymptotic  distribution  of  the  smear  of  the  ‘TIC  matrix  easily  derivable. 
If  (c/b)-)  i  as  <■  and  a— >  «,  for  p  *  Ep  and  q  *  E^  equation  (3-33) 
oecones : 

'jm  (  S(a,Y  )-o")  3  >jn  Eq^.(p-Ep)  +>  X.  \|n  Ep^.(q-q)  *  o0(l). 

Therefore  the  LI  I S  is  asymptotically  normally  distributed  with  mean  0  and 


variance 


S^TZ1Eq  -  kZpTl2 E$. 

E?  and  Z*  are  given  in  (3-34)  and  (3-35)  of  lemma  3-1  and  the  formulae 
for  E^[  and  Z“  are  obtained  by  interchanging  qv  for  p^.  Substituting  Ep , 
E$,  I*  and  into  the  above  asymptotic  variance  we  obtain 


2cj2  i  *(')'  P^k2’2  +  la2Y  ?k2qk +  ()  ?k2qk)2j  -  3(l*\)o4. 


as  asserted  in  proposition  3-2a. 


46 


Remark,  again,  that  the  limit  of  the  variance  of  'Jm(S(X,I)-a2) 
equals  the  variance  of  the  asymptotic  distribution. 


Proposition  3-2b.  Let  S(X)  be  the  smear  of  the  NNC  matrix  for  X  written 
under  independence.  Then, 


ES(X)« 


1  +  2 (m-2  ) 

m-1  (m-l)* 


(m-2 ) (m-3 )  2 

(m-1)2  " 


and 


s  s 

>Jm(S(X)-T2)  N(0,  4 (  2t2  J  pk3  +  pi3)2-3r4)  ) 

k*l  k~l 


(3-38) 


(3-39) 


Proof.  The  smear  of  the  NNC  matrix  for  X  can  be  written  as 
m-1  m-1 


}  54<xi-si-i 


S(X)  = 


i=l  j*l 


(m-1)2 


1 

m-1 


(m-1)2 


(3-40) 


for 

$(.. 

f  *  »  •  / 

.)  defined  in  (3-16). 

To  evaluate  ES(X)  remark  that 

if 

1  i-j 

I>1 

E‘l(Xi,Xi+1;Ij.:lj+1)  - 

P (Xi-Xj)P(Xi+1-Xj+1)  =  r2 

(3-41) 

if 

li-j 

1=1 

E^(2i.Xi+1;Ij,Xj+1)  = 

p(Ii*Xi+l3‘xi+2)  =  ^  pk3, 

k»l 

(3-42) 

Ed 

is  given 

by  (3-42)  for  2(m-2) 

of  the  (o-l)2-(m-l)*(m-l) (m-2) 

pairs 

in 

the 

summation  of  (3-40)  and 

by  (3-41)  for  the  remaining 

pairs. 

Equation  (3-38)  is  obtained  by  taking  expectations  on  both  sides  of  (3- 
40)  and  substituting  from  (3-41)  and  (3-42)  into  (3-40). 

To  derive  the  asymptotic  distribution  of  S(X)  note  that 


II 


for  Mjj  defined  by  (3-26)  and  p  defined  by  (3-31)  and  (3-28). 
Let  g(p)*  hf  .  Since 

(grad  g)T=(|- — <  •  •  •  * jr — )Tg(p)=2pT, 
apn  3pss 


S(X)=g(p)  +  2pT.  (p-p)  +  8allp-pll. 


For  p**Ep 


g(p)»  J  J  (piPj)2  -  -c2 
i*lj«l 

and  ^a(S(X)~e2)  *2  ^a  p^.(p-p)  +  op(l). 

By  leama  3.1  the  LflS  of  the  above  equation  is  asymptotically  normal  with 
mean  0  and  variance  which,  computed  from  (3-33),  is 

^  I  ^  pipjpupv^pipjpu8iv+pipj5iu8jv+pipjpv5ju-3pipjpupv^~ 
i  j  u  v 

•4<  1 1 1  n3*iW  *  1 1  »iV  nS  V  '3l4)’ 


i  j  u 


i  j  v 


4(2^  pk3t2  +  (^  pk3)2-3r4) 


as  asserted  in  (3-39). 

If  X  and  7  are  two  independent  identically  distributed  words, 
i.e.,  if  a«u  and  Pj*q^  for  all  i,  both  S(X)  and  S(X,7)  estiaate  the  same 
parameter  x  of  equation  (3-12).  Propositions  3-1  and  3-2  assert  that  the 
variance  of  the  asymptotic  distribution  of  'Jm(S(X)-4r)  is  twice  that  of 
\|a(S(X,D— c)  for  both  the  crude  and  the  NNC  character  matrices. 


48 


4.  SMEARS  ALONG  DIAGONALS  OF  CHARACTER  MATRICES  FOR  STRING  DATA 
Chapter  2  introduced  a  variety  of  character  matrices  that  are  very 
useful  in  bringing  out  visually  similarities  among  different  words  or 
within  a  word.  The  visual  examination  of  character  matrices  -  as 
insightful  as  it  can  be  -  is  only  a  first  step  in  the  analysis  of  string 
data  as  it  is  limited  in  two  aspects. 

(i)  It  is  stressful  to  the  investigator's  eye. 

(ii)  While  bringing  out  strings  that  may  be  shared  between  the 
words  under  comparison,  it  falls  short  of  assessing  similarities 
quantitatively.  As  a  consequence,  visual  recognition  of  common 
strings  is  partially  subjective. 

This  chapter  addresses  the  question  of  how  to  make  the  detection  of 
diagonals  objective,  i.e.  describable  quantitatively,  and  possible  to 
implement  on  a  machine.  We  are  looking  for  statistics  which  reflect 
the  presence  of  diagonals  in  character  matrices. 

The  statistic  that  has  attracted  the  attention  of  researchers  so 


far  is  the  length  of  the  longest  common  subsequence  (LLCS)  of  the  two 
words  under  examination.  For  . . . ,Ia)  and  Y*(Y^, . . . ,Ta) ,  the  LLCS 

can  be  defined  as: 

max(k:  X4  *Y.  , ...,X4  =Y .  for  lii-,<. . .  <ivim  and  lij-,<. . .  <jvin) 

Jj,  lk  ^k  1  1  * 

Needleman  and  Wunch  [11]  were  the  first  to  propose  the  LLCS  as  a  measure 
of  similarity  between  genetic  sequences.  Their  method  to  find  the  LLCS 


was  later  modified  to  a  more  efficient  dynamic  programming  algorithm  by 
Sankoff  [13].  After  the  LLCS  has  been  computed,  the  path  { ( i^, j : lihik} 
through  the  crude  character  matrix  of  X  and  Y  can  be  traced  for  the 


investigator  to  examine. 


The  algorithm  of  Needleman  and  ?unch  suffers  from  three 
drawbacks.  If  a  relatively  long  string  is  present  in  both  X  and  Y,  it 
will  most  probably  contribute  to  the  LLCS.  However,  if  the  common  string 
is  present  once  in  X  and  in  two  repeats  in  Y  as  shown  in  figure  4-1, 
then  of  the  two  different  common  subsequences  of  approximately  equal 
length,  the  algorithm  will  only  select  one  and  will  not  let  the  molecular 


Fig.  4-1.  The  presence  of  a  relatively  long  string  in  both  X  and  Y  in 
(a)  will  be  most  probably  detected  by  the  Needleman-Wunch  algorithm.  A 
repeat  of  the  string  within  Y  as  in  (b)  will  not  be  detected. 

Furthermore,  the  algorithm  weighs  all  matches  and  mismatches  in  the  same 
way,  counter  to  the  sense  of  the  statistician  that  matches  in  rare 
letters  should  weigh  more  than  matches  among  rather  frequent  letters,  and 
the  knowledge  of  the  molecular  biologist  that  some  substitutions  on 
genetic  molecules  affect  the  function  of  the  molecules  and  the  state 
of  their  cells  dramatically,  whereas  others  do  not.  Needleman  and  lunch 


SO 


were  aware  of  this  problem  and  mentioned  that  weights  other  than  0  and  1 
coaid  be  ased  bat  they  provided  no  hints  as  to  how  to  obtain  these  • 
weights  from  data  and  their  remark  was  later  ignored  in  the  mathematical 
literature.  Finally,  little  is  known  of  the  distributional  properties  of 
the  LLCS.  Chvatal  and  Sankoff  [4]  showed  that  if  Ln  is  the  LLCS  between 
tw'o  words  both  of  lengh  n,  ELa  is  snperaddit ive  with  respect  to  n,  i.e. 
ELn+n  -  £Lffl+ELa,  anci  therefore  E(La/n)  converges  to  a  constant  that 
depends  on  the  size  of  the  alphabet  in  which  the  words  are  written.  They 
also  provided  upper  and  lower  bounds  for  the  limit. 

Deken  [5]  showed  that  (LQ/n)  converges  almost  surely  to  a  random 
variable  under  a  stationarity  condition  on  the  LLCS,  and  that  if  the  two 
words  are  written  independently  of  each  other  and  the  alphabet  letters 
are  equiprobable  -  an  assumption  untenable  for  biological  data  -  the 
limit  is  a  constant.  Finally  Steele  [14]  showed  that  if  the  vectors 
di,Ti)  are  I.I.D.,  VarLQ=  0(n)  and  proposed  the  replacement  of  the 
LLCS  by  other  statistics  in  view  of  its  intractability. 

Suppose  that  X=d^ ,  . .  .  .X^)  and  Y*(Y^ , . . .  ,Ya)  with  Xt  and  Yt 
obtaining  values  in  a  finite  alphabet  {a^,...,as},  assume  that  m2n  and 
let  }  i*!''  a  a  be  the  CC  matrix  of  X  and  Y.  In  the  visual 

examination  of  a  character  matrix,  in  order  to  detect  substrings  common 
to  both  words,  the  investigator  tilts  the  character  matrix,  aligns  his 
axis  of  vision  to  the  matrix  entries  i-l, . . . .min(m.n-k)  *nd 

searches  for  consecutive  non-blank  matrix  entries  along  the  matrix 
diagonals  £**$+*}• 

In  order  to  fix  ideas  we  introduce  some  new  nomenclature.  If  X  and 
Y  share  in  common  a  relatively  long  substring  so  that 


51 


Xu=Yu+k'Xu+l  Yu+l+k"-"Xv  “  Yv+k 


(4-1) 


for  liaivim  and  liu+k_lv+kln, 

then  the  entries  ^n,u+k'  J*u+l,a+k+l‘  *  • »  My,v+k  "i*1  be  aoablank  as 
shown  in  fignres  4-2a  and  4-2b  and  we  shall  say  that  words  X  and  7  share 
in  common  a  string  of  length  v-u+1  lying  along  the  diagonal  at  lag  k,  or, 
more  briefly,  that  X  and  7  sha-re  a  common  string  at  lag  k. 

Let  be  the  CC  matrix  for  X  and  7  and  suppose  that  k2o.  A  long 
common  string  in  (4-1)  wonld  cause  the  ratio  of  non-blank  matrix  entries 
to  the  total  number  of  entries  on  the  diagonal  (U^  Q]  to  be 
higher  than  ratios  on  parallel  diagonals  of  comparable  length  as 
indicated  in  fignres  4-2b  and  4-2c.  We  shall  call  the  ratio  of  nonblank 
matrix  entries  on  (M^ ,  i+fc*  •  •  •  a ^  t0  the  total  number  of  matrix 
entries  along  the  diagonal  (i.e.  m-k) .  the  diagonal  smear  at  lag  k. 

For  m2n,  the  process  of  diagonal  smears  can  be  written  as: 


D(k) 


D(k) 


a-k 

_ 

n-k 

min(a,m+k) 

_ 

min(n,m+k) 


if  k>0 


if  k<0 


(4-2) 


for  £(.,.)  defined  in  (3-6). 

The  process  of  diagonal  smears  is  relevant  in  detecting  common 
substrings  among  X  and  7  because  a  common  substring  would  cause  the 
diagonal  smear  at  a  lag  specified  by  the  string's  position  in  the  two 
words  to  be  relatively  high  and,  conversely,  lags  at  which  diagonal 
smears  are  high  could  signify  the  presence  of  a  common  substring. 


52 


'The  proof  of  the  padding  is  in  the  eating.”  If  the  process  D(.) 
proposed  is  of  any  value,  it  should  pick  up  diagonals  where  they  exist 
and  indicate  that  there  is  nothing  of  interest  where  there  are  no 

diagonals.  The  performance  of  D{.)  will  be  assessed  on  chorion  proteins 
292  and  18B  which  were  examined  visually  in  the  development  of  the 

variety  of  character  matrices  in  chapter  2.  The  cytochrome  c  protein  of 
Tetrahymcna  pyriformis  and  the  chorion  292  protein  were  chosen  as  a 
’control*  pair  because  it  was  expected  that  they  would  share  no 

similarity  whatsoever  as  they  play  very  different  roles  in  the  lives  of 
two  distant  organisms. 

Figures  4-3a  and  4-3b  present  the  CC  and  the  BNC1  character  matrices 
for  the  control  pair  and  illustrate  that,  as  expected,  the  proteins  of 
the  control  pair  share  no  long  strings  in  common.  The  longest  common 
string  is  three  letters  long,  while  the  longest  string  common  in  both 

proteins  up  to  non  consecutive  mismatches  is  only  four  letters  long. 
Figures  4-4a  and  4-4b  plot  diagonal  smears  vs.  lag  for  chorion 

proteins  292  and  18B  and  the  control  pair.  For  diagonals  at  highly 
positive  or  highly  negative  lags  diagonal  smears  are  computed  for  a 
small  number  of  observations;  this  is  the  reason  for  which  the 
variability  of  0^  is  higher  in  the  left  and  right  tails  of  the  plots 
than  in  the  middle. 

As  illustrated  from  their  BNC1  matrix  on  figure  4-5,  the  three 
most  prominent  strings  common  to  the  chorion  proteins  292  and  18B  lie 

along  the  diagonals  at  lags  -12,  -10  and  0,  other  prominent  common 

strings  lying,  in  order  of  diminishing  prominence,  on  the  diagonals  at 

lags  -15,-5,  -20,  5  and  -100.  Table  4-1  lists  the  twenty-four  largest 


53 


diagonal  smears  in  decreasing  order. 

Table  4-1.  Sorted  diagonal  smears  for  proteins  292  and  18B. 


RANT 

LAG 

D. SMEAR 

RANK 

LAG 

D. SMEAR 

1 

-12 

.48 

13 

63 

.21 

2 

0 

.36 

14 

-114 

.20 

3 

-100 

.29 

15 

-129 

.20 

4 

-98 

.25 

16 

10 

.20 

5 

5 

.24 

17 

-83 

.20 

6 

-24 

.24 

18 

-29 

.19 

7 

-10 

.22 

19 

57 

.19 

8 

67 

.22 

20 

73 

.19 

9 

-18 

.22 

21 

-26 

.19 

10 

-2 

.21 

22 

94 

.19 

11 

-5 

.21 

23 

-20 

.18 

12 

-96 

.21 

24 

72 

.18 

It  can  be  seen  from  table  4-1  that  the  diagonal  smears  at  lags  -12,  -10, 
0,  -15,  -5,  -20,  5  and  -100  (  where  prominent  common  strings  lie  )  are 
the  first,  second,  seventh,  thirtieth,  eleventh,  twenty  third,  fifth  and 
third  largost.  If  there  was  a  noablank  character  along  one  of  the 
diagonals  of  length  two,  its  diagonal  smear  (.5)  would  be  higher  than 
any  of  the  above.  Clearly,  it  does  not  suffice  to  simply  sort  diagonal 
smears  in  decreasing  order.  The  threshold  above  which  diagonal  smears 
should  be  considered  as  ’significantly*  high  must  depend  on  diagonal 
length. 

Under  the  independence  model,  both  the  matrix  smear  and  the 
diagonal  smear  estimate  the  same  parameter.  The  matrix  smear  of  the  CC 
matrix  for  two  words  is  computed  from  all  blank  and  nonblank  entries  of 
the  matrix.  The  diagonal  smear  estimates  9  from  the  ratio  of  non-blank 
characters  on  the  diagonal.  Under  the  independence  assumptions  the  matrix 
smear  has  been  proven  to  be  asymptotically  normally  distributed  about 
the  theoretical  smear  and  the  number  of  non-blank  characters  on  the 
diagonal  is  binomially  distributed.  Hence,  an  upper  confidence  bound  from 


54 


the  matrix  smear  and  a  lower  confidence  bound  for  the  same  parameter  from 
the  binomial  data  at  each  diagonal  may  be  computed  for  o.  Throughout  the 
remainder  of  the  chapter  words  will  be  assumed  to  be  written 
independently  within  and  between  themselves. 

Let  U  be  a  l-a^  upper  confidence  bound  for  a  computed  from  S 
through  proposition  3-1.  Let  V  be  the  m.l.e.  of  the  asymptotic  variance 
V  given  in  equation  (3-4).  Then,  clearly  V  converges  in  probability  to  V 
and 

Zl-«1  , 

D  *  S  +  - -  +  o  (1/^m),  (4-3) 

1® 

where  Zj_a  is  the  (1-a^)  quantile  of  the  standard  normal  distribution. 
For  long  words, 

Pr (D>o )  s  1-a^,  (4-4) 

Table  4-2  lists  below  the  90%,  95%  and  99%  asymptotic  confidence 

intervals  for  a  computed  from  S  by  proposition  3-la. 

Table  4-2.  Asymptotic  confidence  for  the  theoretical  smears  of 

CCX  for  protein  pairs  (292.18B)  and  (292,  Cytochrome  c). 


1_al 

292,  1SB 

292,  Cvtochr.  c 

.90 

(.11-. 16) 

(.06-. 09) 

.95 

(.11-. 16) 

(.06-. 09) 

.99 

(.10-. 17) 

(.05-. 10). 

The  length  of  the  confidence  interval  does  not  depend  crucially  on  the 
confidence  level  up  to  the  second  decimal  digit  because  the  estimate  of 
the  asymptotic  standard  deviation  of  S  in  equation  (3-2)  is  small. 

Figures  4-6a  and  4-6b  plot  and  the  asymptotic  two  sided  95% 
confidence  interval  for  a  for  each  of  the  protein  pairs. 

Let  be  an  exact  l-a^  lover  confidence  boand  for  a  computed  from 


the  binomial  data  along  the  diagonal  of  lag  k.  Suppose  that  the  length  of 
the  matrix  diagonal  at  lag  k  is  N  and  that  there  are  B  nonblank  entries 


on  the  diagonal.  1^.  is  defined  as: 


if  Bm0 


h:  " 


the  root  of  the  equation  ^  ^  Jx*(l  -x)1^  if  B>0, 


(4-5) 


Pr(L^<o)  2  l-oj. 


(4-6) 


(See,  for  example,  p.  181.  of  [13.) 


What  use  is  to  be  made  of  U  and  L^?  The  hypothesis  that  or  =  «Tq  is 
rejected  at  level  02  in  favour  of  the  hypothesis  a  >  Oq  if  1^  exceeds  oq. 
In  our  context  no  Og  is  given  to  be  tested;  a  (1-a^)  upper  confidence 
bound  may  be  set  for  the  theoretical  smear.  It  is  then  reasonable  to 


suspect  that  when 


0  <  L*. 


(4-7) 


a  string  is  common  to  the  vords  I  and  7  at  lag  k,  and  expect  that  if  I 
and  7  share  in  common  a  long  string,  then  inequality  (4-7)  will  hold  for 
a  lag  k  specified  by  the  position  of  the  string  in  the  two  words.  Hence 
to  detect  diagonals  hosting  long  common  strings,  instead  of  sorting 
diagonal  smears,  we  propose  to  compare  to  U.  Qualitatively,  the  {Lj.} 
relate  to  U  as  the  (D^}  to  S;  the  presence  of  a  long  string  common  to  the 
two  words  under  examination  raises  and  has  little  effect  on  U.  The 
advantage  of  Lj,  (  vs.  0^)  is  that  take  into  account  diagonal  length 
and  consequently  the  variability  of  L^.  is  smaller  than  that  of  as  can 
be  seen  by  comparing  plots  of  and  L^.  Figures  4-7a  and  4-7b  plot 
the  97.5%  upper  confidence  bound  U  and  the  97.5%  lower  confidence  bound 


56 


at  each  lag,  for  the  tiro  selected  protein  pairs. 

The  probability  of  the  event  at  (4-7)  is 
Pr(  0  <  )  =  PrtoiOCL*  or  UCoCL^  or  UCL^io) 

-  Pr(o<U<Lk  or  D<cr<Lk)  +  Pr(U<c<Lk  or  D<Lk<o)  -  Pr(U<cr<Li.) 
=  Pr(o<Llc)  +  Pr (U<cr)  -  PrOKoCLj.). 

In  view  of  (4-4)  and  (4-6),  an  upper  bound  for  the  event  in  (4-7)  is: 

Pr (U  <  Lj.)  <  ax  +02-  (4-8) 

When  (4-7)  holds,  we  shall  say  that  the  diagonal  smear  is  significantly 
larger  than  the  matrix  smear  at  level  a^-t-o^  . 

As  it  can  be  seen  from  table  4-2,  at  a^— .025,  U  equals  .16  for  the 
pair  of  chorion  proteins  and  .09  for  the  control  pair.  The  lags  at 
which  diagonal  smears  are  significantly  larger  than  matrix  smears  for 
both  protein  pairs  are  listed  below. 

Table  4-3.  Lags  at  which  diagonal  smears  are  significantly  higher  than 
the  CC  matrix  smears  of  protein  pairs  (292, 18B)  and 
(292,  Cytochr.  c).  a2=.025,  O2=.025 

292,  18B  292,  Cyto  c 

LAG  LCB  LAG  LCB 

-12  .39 

0  .28  -  - 

5  .17  -  - 

At  02=02^*025,  the  proposed  procedure  detects  the  matrix  diagonals 
on  which  the  three  most  prominent  strings  common  to  292  and  18B  lie.  No 
diagonal  smears  are  significantly  higher  than  the  matrix  smear  at  these 
levels,  in  the  control  pair. 

To  detect  more  diagonals  one  should  either  lower  U  or  raise  L^, 
i.e.  increase  either  or  .  From  table  4-2  it  can  be  seen  that  (  up  to 
the  second  decimal  point  )  the  asymptotic  95^>  CCB  for  <?  is  .16.;  the  lags 


57 


of  the  diagonals  at  which  the  diagonal  saear  is  signif icantly  higher  than 
the  matrix  smear  for  a^*.05  and  O2a.025  are  also  given  by  table  4-3. 

Figures  4-7c  and  4-7d  plot  the  same  bounds  of  a  for  o^*.025  and 
ct2=.0 5.  The  lags  at  which  diagonal  smears  are  significantly  higher  than 
matrix  smears  at  these  levels  are  given  in  table  4-4. 

Table  4-4.  Lags  at  which  diagonal  smears  are  significantly  higher  than 
the  CC  matrix  smears  of  protein  pairs  (292, 18B)  and 


(292,  Cytochr.  e).  a1=.025, 

aj-.OS 

292, 

18B 

292, 

Cyt.c 

LAG 

LCB 

LAG 

LCB 

-100 

.17 

-72 

.09 

-24 

.17 

-12 

,40 

-10 

.16 

0 

.29 

5 

.18 

Besides  the  diagonals  already  detected  in  table  4-3,  the  two  next 
prominent  strings  in  the  BNC1  matrix  of  the  chorion  proteins  are  detected 
in  table  4-4  and  indication  is  given  that  a  string  common  to  both  words 
might  occur  along  the  diagonal  at  lag  -24.  The  BNC1  matrix  of  figure  4-5 
indicates  that  the  longest  common  string  along  this  diagonal  is  the 
tetrapeptide  AVAG.  On  the  other  hand,  in  the  control  pair  and  at  the  same 
levels,  the  diagonal  smear  at  lag  -72  is  significantly  larger  than  the 
matrix  smear  and  the  detection  is  void  of  any  biological  content.  Hence, 
the  control  pair  does  not  allow  us  to  consider  the  tetrapeptide  selected 
for  the  chorion  pair  as  the  realization  of  a  legitimate  signal. 

It  was  desirable  to  derive  a  simultaneous  confidence  band  for  o  at 
each  lag.  As  this  has  not  been  attained,  a  1-a^  upper  confidence  bound 
for  <3  from  S  and  a  l-a^  lover  confidence  bound  for  the  same  parameter 
from  are  constructed  and  the  lags  of  diagonals  at  which  the  diagonal 


58 


smears  are  significantly  larger  than  the  matrix  smear  are  listed  for 
further  examination.  To  detect  the  common  snsbtrings  in  the  data  the 
investigator  now  focnses  on  the  selected  diagonals.  A  procedure  to 
automate  this  detection  will  be  proposed  in  chapter  5.  In  this  chapter 
the  detection  is  carried  out  by  a  visual  examination  along  the  diagonals 
of  the  BNCl  matrix. 

For  X=(Xj , . . .  ,lm)  and  J-(^1 , . . . ,Ya)  let  min  and  LZO  without  loss 
of  generality  and  let  M  3  }  be  a  character  matrix  at  the  disposal  of 
the  investigator.  The  diagonal  of  M  at  lag  L  aligns  the  substrings 


X1  * • • • * Xn-L 
Yl+L' *  *  * 'Jn* 


(4-9) 


If 


is  the 


prominent 


Ma.a-t-L*  •  *  *  ’^b ,b+L 

substring  of  mostly  non-blank  character  entries 


(4-10) 
on  the 


diagonal  of  V  at  lag  L,  then 


Xa  » ♦ • •  Xb 

Xa-*-L'  • '  •  ,Xb+L  (4-12) 
are  the  most  similar  substrings  of  X  and  I.  The  substrings  of  (4-10)  can 
be  thought  of  as  two  realizations  of  a  signal  in  X  and  T;  the  mismatches 
between  X^  and  T^  in  (4-12)  caused  by  the  imposition  of  noise  on  the 
signal . 

How  good  is  proposed  procedure?  There  are  two  types  of  error  that  the 
procedure  may  commit  and  which,  following  the  use  of  the  terms  in  the 
statistical  literature,  we  call  type  I  and  type  II  errors. 

A  type  I  error  occurs  when  no  signal  is  present  in  both  words  and 


the  procedure  comes  up  with  some  diagonal  smear  significantly  larger  than 


the  matrix  smear.  A  type  I  error  may  be  thought  of  as  a  'false  alarm*. 

No  type  I  error  was  committed  when  the  proposed  procedure  was 
applied  to  the  control  pair  at  a^*.02J  and  a2*.025.  A  false  alarm  was 
given  for  the  same  pair,  at  a^=.025  and  02*. 0  5;  one  out  of  the  242 
diagonal  smears  was  significantly  larger  than  the  matrix  smear.  If  (L^-0) 
were  I.I.D.  and  the  upper  bound  of  Pr(Lj.>U)  in  inequality  (4-8),  was 
attained,  we  would  expect  that  diagonal  smears  would  be  significantly 
higher  than  S  at  approximately  12  and  18  lags  for  the  two  sets  of 
levels  chosen.  (  Because  .05*242=12.1  and  .075*242=18.1. )  The  discrepancy 
between  the  observed  and  the  expected  is  striking  and  can  be  attributed 
to  two  factors:  0^+02  is  only  an  upper  bound  to  Pr{U<L^}  and  {L^-U}  are 
not  independent.  False  alarms  suggesting  that  very  few  out  of  hundreds  of 
diagonal  smears  are  sigaif icantly  high  (in  our  case  1  out  of  242)  are 
painless;  in  requesting  the  investigator  to  focus  on  a  few  diagonals, 
the  proposed  procedure  reduces  drastically  the  volume  of  work  involved  in 
the  visual  examination  of  the  data  . 

A  type  II  error  arises  when  a  string  is  common  to  the  two  words 
but  it  is  not  long  enough  to  cause  the  diagonal  smear  at  the  lag 
specified  by  its  position  in  the  words,  to  become  significantly  larger 
than  the  matrix  smear.  While  the  occurrence  of  a  type  I  error  is  rather 
painless,  a  type  II  error  is  a  serious  one. 

The  detection  of  common  strings  by  comparing  U  to  for  each 
diagonal  was  developed  while  examining  chorion  proteins  292  and  18B.  The 
proposed  procedure  is  now  applied  to  the  proteins  encoded  by  the 
Balbiani  ring  genes  which  are  denoted  by  BRl,  BR2  and  BRC  and  presented 
on  figure  4-8a.  Figure  4-8b  lists  the  proteins  products  of  the  Balbiani 


6.0 


ring  genes  which  will  be  called  8R1,  BR2  and  BRC  proteins.  Figure  4-9 
illustrates  the  BNC1  character  matrix  for  proteins  BR1  and  BK2. 
underlines  the  strings  most  prominently  common  to  the  BR1  and  B&2 
proteins.  The  underlined  strings  lie  on  matrix  diagonals  at  lags  -173, 
-105,  -91,  -31,  -23,  -9,  51,  59  and  133.  The  underlined  strings  suggest 
that  there  are  extensive  internal  repeats  within  each  of  BR1  and  BR2 
proteins;  the  repeats  are  illustrated  in  the  BNC1  character  matrices  for 
the  proteins  on  figures  4-10a  and  4-10b.  For  the  BR1  and  BR2  proteins, 
S=.120.  Asymptotic  two-sided  confidence  intervals  of  a  at  different 
levels,  computed  from  proposition  3-la,  are  presented  in  table  4-5  below. 

Table  4-5.  Two-sided  confidence  interval  for  a  from  the  CC 
matrix  smear  of  the  BR1  and  BS2  proteins. 


(1-V 

Confidence  Interval 

.90 

(  .106-. 134) 

.95 

(.104-. 136) 

.99 

(.099-. 141) 

The  process  of  diagonal  smears  and  the  95%  asymptotic  confidence  interval 
for  <j  are  plotted  on  figure  4-11.  Figure  4-12a  plots  U  and  for 

a^=.025,  a,>=.01.  As  can  be  seen  from  table  4-5,  the  97.5%  DCB  for  a  from 
S  is  .136.  The  lags  at  which  (L^>0}  for  a^“.025  and  c^’.Ol  are  given  in 
table  4-6  below. 

Table  4—6.  Lags  at  which  diagonal  smears  for  the  CC  matrix  of  the  BX1  and 
HS2  proteins  are  significantly  higher  than  matrix  smear.  <i2*.025,  o^.  01. 

LAC  LCB 

-173  .366 

-105  .197 

-  91  .201 

-  31  .166 

-  23  .181 

51  .212 


61 


If  the  strings  detected  visually  and  underlined  on  figure  4-9  are 
regarded  as  nine  legitimate  signals,  table  4-6  indicates  that  at 

a^*.025,  a^  =  .01  the  proposed  procedure  commits  no  type  I  errors;  it 

selects  what  appear  to  be  the  strongest  six  out  of  the  nine  signals  on 
the  BNC1  matrix  of  figure  4-9.  To  obtain  the  remaining  signals  -  and  at 

the  risk  of  the  occurrence  of  type  I  errors  one  must  increase  or  • 

For  as  large  as  .05,  U“  .134.  At  u^.Of,  (^“.Ol,  the  procedure  will 
still  come  up  only  with  the  smears  of  table  4-6  as  significant;  it  is 
rather  stable  for  fixed  a^.  Figure  4-12b  plots  0  and  for  a^*.025  and 

O2>.025.  The  lags  at  which  diagonal  smears  are  higher  than  the  matrix 
smear  are  given  by  table  4-7  below. 

Table  4-7.  Lags  at  which  diagonal  smears  for  the  CC  matrix  of 
the  BK1  and  B22  proteins  smears  are  significantly 
higher  than  matrix  smear.  ^=*.025,  ci2=.025 

LAG  LCB 


-173 

.395 

-105 

.210 

-  91 

.214 

-  31 

.176 

-  23 

.192 

51 

.226 

59 

.139 

143 

.150 

146 

.139 

152 

.152 

At  a^*.025  and  <X2*.025,  four  more  lags  -  besides  the  lags  listed 
in  table  4-6  -  are  selected  ;  59,  143,  146  and  152.  Of  those,  the  first 
one  was  detected  after  a  visual  examination  of  the  BNC1  matrix  on  figure 
4-9.  No  false  alarms  are  given  at  the  three  remaining  lags;  strings  are 
common  to  the  BRl  and  B&2  proteins,  but  they  were  not  as  prominent  in 
their  BNC1  character  matrix  to  be  picked  up  in  the  initial  visual 
examination  of  the  matrix.  Finally  the  relatively  short  strings 


62 


underlined  at  diagonals  of  lags  -9  and  133  should  be  considered  as  cases 
of  type  II  ’errors  when  the  procedure  is  operated  on  the  BR1  and  BR2 
proteins  at  o1=.025  and  a2=.02 5. 

The  strings  underlined  on  the  diagonals  at  lags  133  and  -9  are  too 
short  to  cause  the  corresponding  diagonal  smears  to  be  significantly 
larger  than  the  matrix  smear.  However,  had  protein  BB1  been 
investigated  for  internal  repeats,  the  two  strings  could  have  been 
detected  from  the  strings  underlined  at  lags  51  and  -91  (  at  which  as 
shown  in  tables  4 -6  and  4-7  the  diagonal  smear  is  significantly  higher 
than  the  matrix  smear  for  both  choices  of  and  02  )  and  the  type  II 
errors  would  have  been  eliminated. 

For  notational  convenience  denote  the  BR2  and  BR1  proteins  by  I 
and  Y  respectively.  The  substring  underlined  on  the  diagonal  at  lag  133 
aligns  the  octapeptide 

X27  . . .  X34 

to  T16q  ...  Tigg.  (4-12) 

The  substring  underlined  on  diagonal  at  lag  51  aligns 

X27  •  • »  Xgj  (4—13 ) 

to  173  ...  T114 . 

The  most  prominent  diagonal  of  the  BNC1  matrix  for  the  BR1  protein  on 
figure  4-10a  (except  for  the  trivial  diagonal  at  lag  0)  indicates  that 
the  substring  T|...Ig2  i*  duplicated  (exactly,  with  no  mismatches)  in 
Y83**‘Y164*  re?eat  “it  ia  partially  triplicated  in  *155*  ••*16* 
the  understanding  that  entries  in  the  same  column  are  mostly  identical, 
the  repeat  structure  of  SRI  may  be  summarized  as: 

Y1  T4  “•  Y78  "•  Y82 


64 


The  table  indicates  that  for  both  values  of  the  six  largest 
lover  confidence  bonnds  occur  at  the  same  diagonals.  Eighteen  out  of  the 
twenty  lags  listed  for  each  value  of  02  overlap.  The  two  non¬ 
overlapping  lags  for  each  value  of  02  being  the  lags  for  the  twenty-first 
and  twenty— second  largest  at  the  other  value  of  02*  The  proposed 
procedure  possesses  a  desirable  stability  for  a^- 


.•i2.  4-2.  (a)  '.’ords  X  and  Y  share  in  coramon  the  substring  underline;',  at 
lag  i.  (a)  A  substring  common  to  X  and  Y  shows  up  in  the  CC  -atria  of  X 
and  Y  as  a  diagonal  as  substring  of  nonblanh  natri.r  entries  along  the 
diagonal  of  lag  a.  (c)  The  diagonal  smear  plot  associated  with  the  CC 
matrix  of  X  and  Y. 


*****  2-1 


m  **t,<v,'^<W'4'4CJfcir6**,  W.M llwi  '—*•*.•.•.*  I  ■>  *■*-■*€*<•■»  f  *»<»  *•. I  .n  *  J I , 


1? 

•c 

14 


1 

4  4 


i 

:n 

:« 


» * 
9  9 


■N. 

•« 


*  * 
v  ¥ 


:S 


•'i2c 

■«CC 


y  v 
«  e 


G  G  C 


2  S  ! 


»  «  « 


8  55 

•S  <S  6 

c  :  ; 


*  9 

•:« 


4& 


Mg 

£L  ,f 

3. 


•  • 

c  c 

t  5 

C  6 


c  c 
c  c 


SC  6 


G  6  C 


4— Y 
1*6 

i& 


o  a 

fl  a 


go  a 
lg®  0 
as  c 


c  c  c 
e  a  a 
c  «  a 


g  c  e 


\4*Ui 

"*if. 

-\vx 


go  a 
'’GO  6 
GO  6 


G  C 

•  • 


GC  C 
GO  G 


G  C  G 

G  G  G 

G  G  G 

6  G  G 


G  G 


G  C 

1  1 


cc  c 

r*f  4 


SP 

»•  f 

*M 

«>V 
•4#  * 


■3? 


.1. 

w 

G 


3«  W«4«  «  4  7««41«44l  J9M01040 1  y  l4«ft/««4 1 4  )40»^0««  l  i  >««•  ^44«  m4«4/44«  1 4 14  »« •  .  3«9»  ^  *r«  4 1 2 1--^.  '  i 


-*1  r  ' 


matrix  for  chorion  2?2  and  cytochro-:e  c  irot 


•mt  i+i 


rr  cvtu 

SilcM«l  *wMT«*#vtwwV  I  iy»A**«kiSrfViKCl»»iJ£n**«<>  **.*''*»./  «»S* 


i! 

•• 

$ 

;$ 

its 


is 


■*»ni 

4>C 

:iis 

44V 

J2 

& 

4-*y 

??S 

jii 

410 

2% 


$ 

4.’v 

j  IH 

tf/ 


l  IV 

:*? 


>* f  M  * » »M  U*»»4*JK»  I FOMt  %*&** I KCAt  TA4*VL66V  I  UHi*CQtKMTS*Cmw»«i  TWN**HL* VM  NM*NV#C  U *>* I *»TLK « V 
1  »  f  i  5  *  *  i  4  J  ' 

jt2J4S«?4>* 

•  j4l*>*MI2j4S47#*4f3j444?*MM34S4?t4«W34347*M*  j)49«7**«' 2J44«7*»«l  ^ 


-3b.  1'NCl  matrix  for  chorion  292  and  cytochrome  c  proteins 


133 


77 


21 


36 


92 


LAG 


ip.  4-4a.  'Haponal  sr.ear  of  CC!  plotted  vs.  lap  for  cdorion  proteins 


2)2  and  13K. 


Fig.  4-5.  BNC1  matrix  for  proteins  292  and  18B.  Prominent  common 


strings  are  underlined  and  the  lags  of  their  diagonals  are  indicated 


I  t  t  >  1 

21  38 


LAG 


32 


i.  Diagonal  scears  for  CCM  of  chorion  proteins  292  and  11D  and 
c  confidence  int 


95',>  asynptoti 


erval  for  c  at  each  lag. 


LAG 


agonal  smears  for  CCM  of  chorion  292  and  cytochrome  c 
teins  and  95^  asymptotic  confidence  interval  for  o  at  each 


<t r  confidence  bound  of  a  from  diagonal 
e  bound  of  cr  from  S.  Control  protein  pair. 


Fig.  4-7c .  95°i  Lower  confidence  bonnd  of  a  from  diagonal  smear  and  97.5°fc 
upper  confidence  bonnd  of  a  from  S.  Chorion  proteins  292  and  18B. 


4-7d.  95“  Lower  confidence  bound  of  a  frotn  diagonal  so-.ear  and  97.5" 


upper  confidence  bound  of  n  fron  2.  Control  protein  pai 


1  Lines  6  0  CHHftHCTSSS/LlNe 


rGCGG7HG7GCAMTGAGAHHGGC7GA«GC7GA»HHATGTCCTAGAACAMH7GG7MGAT .  >„ 
-aTGCCAG7AAATCCAGATG7AC77CAGC7GC7AhaCCAAGCAG*haaTCCGAACC7hCC 
-wCGuArCTAMACCTACHCCACMCAAACCAMGTMMGuAMTCTHHHCCTAGMCCHGAC^MA 
:CamG7haGGGATC7AAACC7aGACCAGaGAaaCCahGTAhGGGaTC7AmhCC7AGalCa 

1 aGGGC7CCGG7AC7GCam7GACam 1 
LINES  60  CH»« ACTSRS/'LINE 

;tai;cahhcacagcaahcca«gcahgcacagcaagcac«gca»hcctagca»gcatagc^ 

-wCCTmGTmwaCmChGTAhGCChGAhhhhTGCGGTmGTGlwhTGhmGaGmhCTGwhGLml 
;*AAm7G7CC7hCAAAGAATGG7AGAT7CAACAG7AACAG*TCTaC7TG7wCC7CaG77G 
;TmmmCGmhGCahaGCamGCamhCaCaGGamGGCmmGTAmhChGmGCAmhCGAmGCmmGC 
v:aGCA«wCC7AGCAAGCACAG7AmAi:C7AGCAAGCATAGCamhCC7AGTA«hCACAG7» 
-GCCmGhmmuhTGCGGTAGTGCAATGAAG^GamCTGauGlAGGmmhaTGTGGTAGi^hhCm 
TQ27  hlATTCAACAGT AAGAGATGT AL i  i u i AuuT  lau i  iL*l7AAmlCAmLwAaaClaA 
GCa«mCaCaGCaaCCCaaCTA*»hCACaGCAhmCCT»GC«hGCaCaG7AmhCCTAGCamGw 
-* 7 hGC ahaCC7 AGT ALwCACAGT AAGCC AGa Amam TGC^l7AGTGCaaT0amGaGAmC ;  j 
.mGCmGChauaTGTGC7aCaaaCAA7CGTACaTTCaaCaG7aAGaGAaG7mC77GTwaC7 
■:  aGTTGGTAAACCAAGCAAACCAAGCAAACACAC  1 

UPLINES  60  CHARACTESS/LINE 

-.GaGaTGCaaG7G7AamTCaGC7CGaaaACCaaGC7CTCGAaaCaCTAG7hmCCaChC7C 
saccaaagaccagtaoacacagtggaccaaacaccagtaaacacagtggaccaaaaacaa 
;.3aaaCaCAC7GCaCCaaACACCAG7AACCACAG7CGACC7AaaCGChCAmanCCmGaaa 
“ATGCGGTAGTGCAATSAaGAGAACTGAACCTGAAAAhTGTGCCAhGAaGAaCGGaACaT 
■"CahCAGTAhGaGATGCAAGTGTACATCAGCTGGamhhCCAhGGTCTCGaaAGhCThGT  a 
-GCACa£7CCACCAAAGaCCAG7AAACACAG7CCACCAaaGACCAG7AaaCaChG7GGaC 
;«wAahCAhCCh«aCaCaC7GCACCAaAGACCAG7AaGCACAG7GGACC7hahCCChCma 
-hCCaGaaAahTGCGGTaGTGGAATCAaGaGaAGTGaaGGTGaaaahTGTGGCaaGaaGa 
-CGGamGaT7CAACAG7AmGAGaTGCaAG7G7ACATCACC7GGAaAhCCAaGC7C7CCaa 
->GmC7AG7amCCACaG7GGACCAaaGACCAG7AAACACAG7GGACCA»aGaCCaG7ahmC 
-CaG7GGaCCamaaaCaaGCammCACAG7GGaCCahhCaCCaG7aaGCaCaC7GGaCC7m 
-m03CaGaaamCCaCAaAaaTGCSG7AC7GCAaTGAhGaGaaC7GAAi;C7Gaaaaa7G7G 
I IaaGAhGaaCCGaaGA77CaACACTA«CACaTCCAhG7C7ACa7CaGC7GGaaaaCCah 
3CTC7CCaaaGaC7AG7AaGCACAG7GGACCaaaGwCCaC7aaaChCAG7GGaCCAaaCa 
I IhG7aaaCaCaG7GGaCCamaaaCAaGCAaaCaCaGT  GGaClaaaGaClaG i aaG whCh 
itggacctaaacccacawaaccacaahkatccggtagtgcaatgaagacaactghhcctg 
-«hhhTGTGCCamGaaGahCGGAmGaTTCahCAG7aaGmGh7GCaaG7G7GCATCaGCTG 
GammACCAAGCAAATCTGAGCCAAGAACTGAGCGTCCTACTACaTGCATTGaaTCT  agtg 
-aaGCCaTGaG7CGaC7CCAAC7GaaGGACC7hCaaC77G7G7CGaaTC7aG7Gam«GCC 
-.TGaGTCAAC7T7AaC7GAaGGACC7ACAAC77G7G77GAT7C7aG7GaGAG7a»GGAGa 
:‘-CCaCaaCCaGC7AT77GCGaTGG7GaaaTCACAG7TAaaCaaTC7ahaaaGTG7Ga7C 
-LArrtGGCGGCAAG77TArtTCC7GATAAC7G7AAATGCAC7AA«GAACC7GT7ACGGAGG 
I<-iCGahCaaC77GTATCGaa7CaaGGGAACGaA7GaGGaaG77  ACCACaam£aGC7  t 


Fi*.  4-8«.  ONA  saqnancas  for  3albUai  riaj  .jsaa*  3RI,  3R2.  3CC. 


:  -*NRK«EAEKCARRNGRFNrtSKCRC7SAGKPSRKSE.oSK'G3KP9PS!<PSKESi^PRPEKP'5KG3KP0P 
I3ApPP5GCG3AMRi<A£M£i<CAKfiNG*FNH!sKCRCT3AGkp-SSK;»EP:»kG3kpKPE.kPSAS3kPftPE 
:  /PRPEKPSKGSKPRPEGCGSANR 

»_  m  s  1  .  j 

.  r*  .-i  Sr*  WSKWSkP'iKrHSKPSKHSKPSKwMaSrtFlk'PTSHrtK.C.^KkNCPFNSk PCTCT3 vO^PSkPSk 

r  KPSk'HSKPSKHSKPSKHSKPSKHSkPSXCCSAPIKPTSA^KCAftKNGPP^aKPCTCT^vaK'PSkP 
rr.HSAPSkHSKPSkHSi<P3KHSKPEKCG3Hf1KRTEAHKGAKKNGRFNSKR-STCNSVG»sPSNpSKrt 

•  :  <'';X':AGXP'$rPKT$KHSGP,<7SKHSGPKTSKHSGP<7Sk'H'?GP,«,7$KHSGPf<'PTKPSkC33Af>'i.'P 
.  -c. XNGPPNSKRGXCTSAGkPSSPXTSkHSGPXT.skHSGPk  7 okHSGPXTSk.HSGPk  rSkMSGPkPT 
1 1  -nKfiTEAEk  CAKKNGPPNSkPCXC7SAGKPSSPXT SKHSGP*  T SKHSGPK7SKHSGPkTSkHSGP 
:  G?kPTkPEKCG3A(lKRTEAEi<CAKKNGaFNSKRCKC73AGkPSSPt<7SKHSGP'<7SkHS<;Pk7SkW 
: A3GP<7SKHSGPkPTKPEKCG3AWKR7EAEKCAKKNCRFN3KRCKCASAGKPSKSEPRTERPT7C 
i-eS7P7EGP77CVS33E3HES7LTEGP77CV033ESkE7PE?AIC0GcNRVkeSkk:CGE:GGkFNP 
£?v-EGPt7C:ES3ESMRKLPGRa 


Fij.  ■‘-3b.  351.  352.  35C  ?rot«in». 


til./.  T  •/>  I  XJi’.'f'Cl 


4-12'j  97.5'”i  iov;er  confidence  bound  of  c  fro~  diagonal  sr.ear 

97.5  j  upper  confidence  bound  of  o  from  3  for  ">"2  and  3H1  proteins. 


84 


5.  AUTOMATED  DETECTION  OF  SIGNALS  WITHIN  TWO  WORDS. 

Chapter  4  introduced  a  procedure  which  automates  partially  the 
visual  examination  of  character  matrices  of  two  words  by  focusing  on  the 
matrix  diagonals  for  which  the  diagonal  smear  is  significantly  higher 
than  the  matrix  smear.  The  detection  of  the  common  strings  lying  on  the 
diagonals  selected  was  carried  out  visually  in  chapter  4.  This  chapter 
proposes  another  procedure  to  automate  the  identification  of  the  string 
most  prominently  shared  in  common  by  the  two  words  under  comparison. 
When  a  string  will  be  referred  to  as  shared  in  common  by  words  X  and  7, 
it  will  be  understood  that  a  substring  of  X  will  be  identical  to  a 
substring  of  7  except  for  a  few  occasional  mismatches. 

The  proposed  procedure  is  applied  on  matches  and  mismatches  between 
substrings  of  the  two  words  under  comparison  at  all  possible  lags  without 
taking  into  account  the  nature  of  the  particular  matches  and  mismatches. 
A  procedure  assigning  weights  to  the  latter  is  presumably  more  powerful 
than  the  one  proposed  here. 

Suppose  that  the  diagonal  at  lag  L  in  the  CC  matrix  of  words 
X= (X-p  . . . »Xm)  and  Y=(Y^,...,Y  )  is  of  length  N.  For  notational 
convenience  we  denote  the  diagonal  entries  as 

Z1....,ZN.  (5-1) 
For  the  diagonal  at  lag  L,  ^ (^i » 0  defined  by  equation  (3-6). 
In  this  chapter  nonblank  and  blank  matrix  entries  will  be  denoted  by  1 
and  0  and  will  be  also  called  successes  or  matches  and  errors  or 
mismatches.  Independence  between  and  within  X  and  7  is  assumed  throughout 
the  chapter.  Under  this  assumption  Z^  are  I.I.D  and  the  probability  that 
Z^  is  a  nonblank  character  equals  the  theoretical  smear  of  equation  (2-3) 


85 


which  in  this  chapter  is  denoted  by  p.  (  Instead  of  a  used  in  chapters  2, 
3,  and  4.)  A  string  shared  in  common  by  I  and  7  will  be  called  a 
signal .  A  signal  at  lag  L  will  show  up  as  a  substring  of  (5-1)  with  a  few 
occasional  errors.  Since  only  a  few  occasional  mismatches  are  allowed  in 
the  realizations  of  the  signal  in  the  two  words,  detecting  a  signal 
common  to  X  and  7  at  lag  L  can  be  thought  of  as  detecting  a  substring  of 
Z^,...,Z^  such  that  the  probability  of  a  success  within  the  substring  is 
higher  than  the  probability  of  a  success  outside  it. 

The  procedure  proposed  for  the  detection  of  the  signal  depends  on 
two  parameters  pg  and  p^,  pg  i  p^.  pg  is  the  probability  of  a  success  in 
the  absence  of  a  signal.  p^  is  a  lower  bonnd  for  the  probability  of  a 
success  in  the  signal,  pg  and  are  specified  by  the  investigator.  It  is 
sensible  to  take  pg  to  lie  within  the  conventional  confidence  intervals 
for  the  theoretical  smear  computed  from  proposition  3-1.  p^  should  be 
close  to  1.;  the  smaller  the  p^,  the  larger  the  probability  of  a  mismatch 
allowed  by  the  investigator.  It  is  desirable  that  the  results  of  the 
procedure  do  not  depend  crucially  on  the  choice  of  pg  and  p^. 

Suppose  that  1  i  i  <  j  i  N  and  let  Ljj  (pg.pj)  be  the  generalized 
log-likelihood  ratio  (GLLR)  for  the  hypothesis  testing  problem 

H0:  P  =  PO  vs*  HA:  P  >  Pi  (5-2) 
based  on  the  substring 


Let  s^  and  Sg  be  the  number  of  successes  and  the  number  of  mismatches 
and  p=( s^/ sg+s^)  be  the  fraction  of  matches  in  the  substring  (5-3).  s^ 
Sg  and  3  depend  on  i  and  j;  the  dependence  is  not  indicated  to  avoid 
making  subsequent  expressions  cumbersome.  The  generalized  likelihood 
ratio  (GLR)  for  the  testing  problem  (5-2)  based  on  the  substring  in  (5- 


Sup  psl(l-p)*0 
P>$1 

P01(1'!>0)i0 


(5-4) 


It  can  be  easily  verified  that  the  function  s^logp  +  SQlog(l-p)  attains 
its  maximum  at  p,  increases  in  [0,p]  and  decreases  in  [p,l]. 
Consequently, 


Sup  pSI(l-p)S0  = 
P>P. 


•sl-  .  s0 

p  (1-p) 


S 1 , ,  *0 
Pl  (1-p!) 


if  P  -  Pi 


if  P  -  PX 


(5-5) 


and  the  GLLR  for  the  hypothesis  testing  problem  in  (5-2)  from  the  data  in 
(5-3)  is: 


Lij(P0'Pl)= 


* 

p 

Po 

s0  lo*  ■ 

(-»  *-» 

i  i 

o 

if  p  >  Pl 

Pl  + 

sQ  log  ■ 

1-Pl 

Po 

1_P0 

if  P  i  Pi 

(5-6) 


For  the  specified  pQ  and  pj,  the  proposed  procedure  finds  the 
substring  of  (5-1)  which  maximizes  the  GLLR  (5-6)  over  all  the  substrings 
on  the  diagonal  (5-1).  If  the  maximum  GLLR  exceeds  a  critical  value  which 
depends  on  N,  p q  and  an^  will  be  determined  by  simulation  later  in 
this  chapter,  the  procedure  detects  a  signal  common  to  words  X  and  I  to 
show  up  as  the  substring  maximizing  the  GLLR.  Formally,  if. 


JKPO’Pi)  *  Lij(p0,p1), 

l<i<j<N 


LIj(P0'Pl^  =  #(P0'Pl) 


(5-7) 

(5-8) 


and  M(pg,p^)  is  greater  than  a  critical  raise  to  be  elaborated  spon 
later,  the  proposed  procedure  detects  the  substring 

Zj , . . . , Zj ,  ( S-9 ) 
to  be  a  signal  allowing  for  error  with  probability  less  than  1-p^, 
immersed  in  noise  where  the  probability  of  a  match  equals  p q.  We  shall 
say  that  the  signal  is  realized  as  the  pair  of  substrings 


in  the  data. 

The  proposed  procedure  can  be  considered  as  a  modified  GLR  for 
testing  the  hypothesis  that  Z^,...,Z^  is  a  noisy  string  rs.  the 
hypothesis  that  somewhere  in  the  string  there  exists  a  signal  of  success 
probability  higher  than  pj.  The  relation  between  the  two  is  examined  in 
Appendix  1. 

Remark  that  if  for  the  substring  in  (5-9),  p^pjlp-^,  then 

S-j  Sq  S,  Sq 

Sup  p  (1-p)  =  Sup  p  (1-p). 

P-Pl  P-?2 

and  therefore  Ljj (pg,p^)=Ljj (pg.pj) ;  the  same  substring  will  maximize  the 
GLLE  for  the  choices  (Pq'Pj.)  and  (pq»P2^' 

When  Pi=Pq,  !,—  (.,.)  reduces  to: 


P  1-P 

i°*  —  +  an  l0«  i — 

:-Po 

0 


if  P  ^  Po 
if  5  -  P0 


which  is  well  known  to  be  the  GL LS  test  statistic  for  the  hypothesis 

H0:  P  *  PO  vs*  ha:  P  >  PO 


The  proposed  procedure  was  applied  to  the  diagonals  at  lags  5,  0 


88 


and  -12  of  the  CC  matrix  of  chorion  proteins  292  and  18B.  The  diagonals 
were  listed  in  table  4-3;  at  levels  *  oj  *.025,  their  diagonal  smear 
was  significantly  higher  than  the  matrix  smears  for  the  chorion  292  and 
18B  proteins.  pg  was  chosen  at  .10  and  .17,  the  endpoints  of  the  99% 
confidence  interval  for  the  matrix  smear  given  in  table  4-2.  For  each  pQ, 
Pj  was  selected  at  pg  and  .70,  .80,  .90,  .95. 

Table  5-1  presents  the  substrings  of  proteins  292  and  18B  that 
maximize  the  GLLR  for  the  various  choices  of  pg  and  p^.  In  the  discussion 
pertaining  to  tables  5-1,  5-2  and  5-3  the  substrings  of  table  5-1  will  be 
called  signals;  the  critical  values  that  the  maximum  GLL&  will  have  to 
exceed  for  the  substrings  to  be  legitimately  considered  realizations  of 
signals  in  the  data  will  be  elaborated  upon  later.  The  detection  of  the 
signal  in  the  data  depends  on  the  choice  of  pg  and  p^>  but  the  same  pair 
of  substrings  may  maximize  the  GLL&  for  two  different  choices  of  (pg.pj). 
Next  to  each  pair  of  similar  substrings  of  table  5-1  is  typed  a  2  by  5 
matrix  of  characters  0  and  1  is  typed.  The  indices  of  the  matrix  elements 
correspond  to  the  2  by  5  choices  for  (pg,p^)  and  a  matrix  element  is  1  if 
the  substring  listed  maximizes  the  GLLB  for  the  values  of  (pg.p^) 
specified  by  the  indices  of  the  matrix  element. 


89 


Table  5-1.  The  substrings  of  ehorion  292  end  18B  proteins  maximizing  the 
6LU  of  (5-5)  for  the  2  by  5  ehoiees  for  (p0«P2.)«  #1"» 
indicate  the  values  of(pg.p^)  for  which  the  listed  substrings 


max  ini ze  the  6LL2. 

LAG 

GGLGTEG  11111 

5  GGLGYEG  11111 

0  The  first  114  amino  acids  of  both  proteins  10000 

00000 

MSTFAFLFLCIOACLVQNVFGVCKGGLGLKGLAAPACGCGGLGYEGLGT  OlOOO 

0  MSTFAFLLLCAQACLIQSVYSYGCGCGCGGLGGYGGLGYGGLGYGGLGY  OOOOO 

MSTFAFLFLCIQACLVQNV  00100 

°  MSTFAFLLLCAQACLIQSV  11100 

MSTFAFLFLCIQACL  00011 

0  MSTFAFLLLCAQACL  00011 

GSYGGEGIGNVAVAGELPVAGTTAVAGQVPIIGAVDFCGRANAGGCVSIGGRCTGCGCGCG  11110 

~12  GEYGGTGIGNVAVAGELPVAGKTAVGGQVPIIGAVGFGGTAGAAGCVSIAGRCGGCGCGCG  11100 

YGGEGIGNVAVAGELPVAGTTAVAGQVPIIGAVDFCGRANAGGCVSIGGRCTGCGCGCG  00001 

~12  YGGTGI GNVAVAGELPVAGKTAVGGQVP I IG AVGFGGTAG AAGCVS I AGRCGGCGCGCG  00011 


Table  5-2  presents  p,  the  ratio  of  matches  for  the  substrings  of 
table  5-1  and  table  5-3  lists  the  lengths  of  the  diagonals  and  the  values 
of  for  the  substrings  of  table  5-1. 

Table  5-2.  Ratio  of  matches  for  the  substrings  of  table  5-1. 

Pi 


Po 

.70 

.80 

.90 

.95 

P0S»10 

l. 

1. 

1. 

1. 

1. 

Pq=.17 

1. 

1. 

1. 

1. 

1. 

p0=.l° 

.38 

.53 

.80 

.87 

.87 

P0=-17 

.80 

.80 

.80 

.87 

.87 

p0=.10 

.82 

.82 

.82 

.82 

.83 

p0=-17 

.82 

.82 

.82 

.83 

.83 

90 


Table  5-3.  Values  of  attained  for  the  2  by  5 

possible  Tallies  of  (Pq»P].)< 

Pi 


LAG 

DIAGONAL 

LENGTH 

po 

.70 

.80 

.90 

.95 

5 

116 

p0=.io 

p<r*17 

16.12 

12.40 

16.12 

16.12 

16.12 

16.12 

16.12 

16.12 

16.12 

16.12 

0 

121 

P0=*10 

p0=.17 

30.95 

17.55 

25.33 

17.55 

25.18 

17.54 

24.17 

17.43 

23.49 

16.75 

-12 

121 

p0=.10 

P(T*17 

87.50 

61.86 

87.50 

61.86 

87.50 

61.86 

85.69 

60.50 

81.41 

56.22 

Notice  that  for  fixed  pQ,  as  p^  increases  (i.e.,  as  the  procedure 
allows  for  a  smaller  probability  of  error  in  the  signal)  the  substrings 
maximizing  the  GLLR  are  shorter  and  have  a  higher  ratio  of  matches. 

At  lag  5  the  heptapeptide  GGLGYEG  maximizes  the  GLLR  of  equation 
(5-6)  for  all  selected  values  of  (pQ.p-^).  Depending  on  the  values  of  p0 
and  pj ,  two  different  substrings  on  the  diagonal  at  lag  -12  maximize  the 
GLLR.  The  signal  detected  for  (pQ,Pi)  =  (.10,  .95),  (.17,. 90)  or 
(.17,. 95)  deletes  from  the  longer  signal  -  detected  for  the  remaining 
seven  values  of  (Pq,p^)  -  its  starting  dipeptide  which  contains  one 
mismatch.  On  the  diagonal  at  lag  0  four  different  substrings  maximize 
the  GLLR  for  the  ten  choices  of  Pq  and  p^.  A  visual  examination  of  the 
substring  detected  for  pQ=.10  and  p^=,70  reveals  that  MSTFAFL*LC*QACL 
and  GGLGY*GLGY  are  present  on  its  right  and  left  ends.  (  The  occasional 
errors  in  the  common  string  are  denoted  by  an  asterisk.)  The  substring 
on  the  left  maximizes  the  GLLR  when  small  probabilities  of  error  (pj=.90 
or. 95)  are  allowed.  Noise  intervenes  between  the  two  strings.  Vhen  large 
probabilities  of  error  are  allowed  for  (pq=.10  and  p^=.70)  the  matches  on 
the  right  and  left  cover  for  the  noise  in  the  middle  and  the  two  signals 
together  with  the  intervening  noise  maximize  the  GLLR.  Similarly,  a  few 


91 


matches  to  the  right  of  the  string  GGLG7*GLG7  on  the  diagonal  at  lag  0 
cause  the  substrings  consisting  of  the  first  114  amino  acids  of  proteins 
292  and  18B  to  maximize  the  GLLR  for  p=.10  vs.  p>.10. 

The  BNC1  matrix  for  chorion  proteins  292  and  18B  was  presented  in 
figure  4-5;  its  visual  examination  was  conducted  prior  to  and 
independently  of  the  application  of  the  procedure  proposed  in  the  present 
chapter.  Table  5-4  summarizes  the  results  of  the  visual  examination  along 
lags  5,  0  and  -12.  Vhen  a  prominent  substring  along  the  diagonal  of  the 
BNC1  matrix  begins  or  ends  with  a  *•*,  the  realizations  of  the  signal 
are  taken  to  start  or  end  one  character  to  the  left  or  right  of 

Table  5-4.  Strings  shared  in  common  by  chorion  292  and  18B  proteins 
recognized  visually  at  the  diagonals  of  table  4-3. 

L^G  GG LG TEG LG 
3  GGLG7EGTG 

MSTFAFLFLC 1QACLVQ  and  GGLG7EGLG7 
MSTFAFLLLCAQACLIQ  and  GGLGTGGLG7 

GS7GGEGIG WAVAGELPVAGTTAVAGQVPIIGAVDFCGRANAGGCVSIGGRCTGCGCGCG 
*  GE7GGTGIGNVAVAGELPVAGKTAVGGQVPIIGAVGFGGTAGAAGCVSIAGRCGGCGCGCG 

The  visual  examination  of  the  BNC1  matrix  of  proteins  292  and  18B 
detects  common  substrings  that  are  selected  by  the  proposed  procedure  for 
the  2  by  5  choices  for  (Po*?],)*  Tha  advantage  of  the  proposed  procedure 
is  that  it  automates  and  quantifies  the  detection  process. 

The  application  of  the  proposed  procedure  may  result  in  the  two 
types  of  error  that  were  referred  to  in  chapter  4.  Relating  to  the 
detection  of  a  substring  common  to  two  words  is  the  problem  of  the 
detection  of  a  string  of  successes  (up  to  a  few  occasional  mismatches)  in 
the  word  of  (5-1).  The  latter  will  be  called  the  one-dimensional  problem 
to  be  contrasted  from  the  former  two-dimensional  problem.  The  two  types 


92 


of  error  are  investigated  for  the  one-dimensional  problem  first. 

The  asymptotic  distribution  of  the  GLLR  L^jCpQ^p^)  of  (5-6)  for  the 

hypothesis  testing  problem  in  (5-2)  has  been  derived  in  [2].  The  .95 

qnantile  of  the  distribution  of  M(.,.)  are  estimated  by  simulation. 

100  binary  strings  of  length  50,  100,  200  and  300  were  randomly 

generated  with  probabilities  of  success  n  =  .10,  ,15  and  .20.  For  each 

string  M(.,.)  was  computed  for  pg=.10,  .15,  .20  and  p^=  pg,  .70,  .80, 

.90,  .95.  To  facilitate  the  presentation  of  the  simulation  results,  the 

estimate  of  the  .95  quantile  of  the  distribution  of  M(pg,p^)  for  noisy 

strings  of  length  L  generated  with  probability  of  success  x  is  denoted  by 

%/PO'Pi  5-5  to  5-8  present  the  estimates  Q  (.,.|.)  for  each 

combination  of  string  length  L,  probability  of  success  in  the  binary 

strings  n,  and  the  nominal  parameters  pg  and  p^.  The  distribution  of 

M(.,.)  is  discrete.  The  .95  quantiles  are  estimated  by  the  midpoint 

between  the  ninety-fifth  and  ninety-sixth  largest  observations  for  each 

combination  of  parameters.  Next  to  Q  (.,.1.),  the  largest  observation 

•  ' 

from  the  100  runs  is  listed  in  parenthesis  in  tables  5-5  to  5-8. 

Table  5-5.  Upper  5%  points  for  X(.,.)  estimated  from  100  binary  strings 
of  length  50  generated  with  success  probability  if. 

The  largest  observations  in  the  100  runs  are  listed  in 
parenthesis. 


n 

Po 

.70 

Pi 

.80 

.90 

.95 

.10 

Po=-10 

p0=.15 

p0=.2° 

6.91(9.21) 

5.69(7.59) 

4.83(6.44) 

6.91(9.21) 

5.69(7.59) 

4.83(6.44) 

6.91(9.21) 

5.69(7.59) 

4.83(6.44) 

6.91(9.21) 

5.69(7.59) 

4.83(6.44) 

6.91(9.21) 

5.69(7.59) 

4.83(6.44) 

.15 

po3-10 

po3*15 

Pg-.2° 

7.82(12.1) 

6.40(9.49) 

4.83(8.05) 

7.52(12.0) 

5.69(9.49) 

4.83(8.05) 

6.97(11.5) 

5.69(9.49) 

4.83(8.05) 

6.91(11.5) 

5.69(9.49) 

4.83(8.05) 

6.91(11.5) 

5.69(9.49) 

4.83(8.05) 

.20 

Pg=.l° 
P0=-15 
Pq= • 2° 

11.7(13.8) 

7.59(11.4) 

6.08(9.66) 

9.23(13.8) 

7.59(11.4) 

6.00(9.66) 

9.21(13.8) 

7.26(11.4) 

6.00(9.66) 

9.00(13.8) 

7.20(11.4) 

5.94(9.66) 

8.79(13.8) 

7.00(11.4) 

5.73(9.66) 

93 

Table  5—6.  Upper  5%  points  for  estimated  from  100  binary 

strings  of  length  100  generated  vith  success  probability  u 
The  largest  observations  in  the  100  runs  are  listed  in 
parenthesis. 

Pi 


n 

Po 

.70 

.80 

.90 

.95 

.10 

p0=.10 

p0=.i5 

p0=.20 

7.22(9.21) 

3.69(7.59) 

4.83(6.44) 

7.22(9.21) 

5.69(7.59) 

4.83(6.44) 

7.15(9.21) 

5.69(7.59) 

4.83(6.44) 

6.91(9.21) 

5.69(7.59) 

4.83(6.44) 

6.91(9.21) 

5.69(7.59) 

4.83(6.44) 

.13 

p03.10 

P(T-15 

p0=.2° 

9.21(10.9) 

7.59(7.39) 

6.44(6.44) 

9.21(10.1) 

7.59(7.59) 

6.44(6.44) 

9.21(9.21) 

7.59(7.59) 

6.44(6.44) 

9.21(9.21) 

7.59(7.59) 

6.44(6.44) 

9.21(9.21) 

7.59(7.59) 

6.44(6.44) 

.20 

p0=-io 

P(T*15 

Pq=.20 

13.1(17.2) 

9.34(10.3) 

6.56(8.05) 

11.4(12.7) 

7.86(9.49) 

6.44(8.05) 

10.0(11.5) 

7.59(9.49) 

6.44(8.05) 

9.21(11.5) 

7.59(9.49) 

6.44(8.05) 

9.21(11.5) 

7.59(9.49) 

6.44(8.05) 

Table  3-7.  Upper  3%  points  for  estimated  from  100  binary 

strings  of  length  200  generated  vith  success  probability  a 
The  largest  observations  in  the  100  runs  are  listed  in 
parenthesis. 

Pi 


n 

P0 

.70 

.80 

.90 

.95 

.10 

P0=*l° 

p0=.15 

p0=.20 

6.91(9.21) 

5.69(7.59) 

4.83(6.44) 

6.91(9.21) 

5.69(7.59) 

4.83(6.44) 

6.91(9.21) 

5.69(7.59) 

4.83(6.44) 

6.91(9.21) 

5.69(7.59) 

4.83(6.44) 

6.91(9.21) 

5.69(7.59) 

4.83(6.44) 

.15 

PO3*10 

Pq=.15 

P0=*.2° 

10.5(15.4) 
7.26(12.2) 
6.0 0(9.96) 

9.21(15.4) 
7.26(12.2) 
6.0 0(9.96) 

9.06(15.4) 

7.26(12.2) 

6.00(9.96) 

9.00(15.4) 

7.20(12.2) 

5.94(9.95) 

8.79(15.1) 

7.00(11.9) 

5.73(9.69) 

.20 

Po=-i° 
p0=.15 
Pn= • 20 

18.2(28.1) 

9.49(13.3) 

7.33(11.3) 

11.5(16.1) 

9.16(13.3) 

7.31(11.3) 

11.5(16.1) 

9.15(13.3) 

6.97(11.3) 

11.0(16.1) 
8.48(13.3) 
6.6 9(11.3) 

10.3(16.1) 

7.91(13.3) 

6.51(11.3) 

94 


Table  5-8.  Upper  5%  points  for  os tins ted  from  100  binary 

strings  of  length  300  generated  with  success  probability  r. 
The  largest  observations  in  the  100  runs  are  listed  in 
parenthesis. 

Pi 


n 

P0 

.70 

.80 

.90 

.95 

.10 

p0=.io 

Po--is 

P0=.20 

8.98(11.1) 

6.64(8.76) 

5.63(7.01) 

8.24(11.1) 

6.64(8.76) 

5.63(7.01) 

8.06(11.1) 

6.64(8.76) 

5.63(7.01) 

8.06(11.1) 

6.64(8.76) 

5.63(6.95) 

8.06(10.6) 
6. 64(8.76) 
5.63(6.58) 

.15 

P0=’10 

p0=.15 

p0-.20 

12.2(16.0) 

7.59(7.59) 

6.44(7.01) 

9.81(11.6) 

7.59(7.59) 

6.44(7.01) 

9.40(11.5) 
7.59(7.59) 
6.4 4(7.01) 

9.21(11.0) 

7.59(7.59) 

6.44(6.95) 

9.21(10.6) 

7.59(7.59) 

6.44(6.58) 

.20 

p0=-io 

Pq=.15 

P0=-20 

24.5(26.6) 

10.3(13.3) 

8.05(9.66) 

13.7(15.3) 

9.49(11.4) 

8.05(9.66) 

11.5(13.8) 

9.49(11.4) 

8.05(9.66) 

11.5(13.8) 

9.49(11.4) 

8.05(9.66) 

11.5(13.8) 

9.49(11.4) 

8.05(9.66) 

Tables 

5-5  to  5-8 

indicate  that 

when  P02ff 

the  estimate 

Ql(po-pi  lff) 

is  stable  over  choices  of  at  all  string  length.  It  is  expected  that 
as  total  string  length  increases,  the  quantiles  of  the  distribution  of 


M(.,.)  increase,  the  increase  (expected)  to  be  more  noticeable  for 


shorter  strings.  This  holds  in  all  the  144  comparisons  of  estimates  of 
quantiles  in  tables  5-5  to  5-8  with  thirteen  exceptions.  In  particular. 


Q100^  *10 

'Pi 

1-10) 

> 

Q200( 

.  1 0 ,  p  ^  1  • 

10) 

for 

Pl= 

.10, .70, 

.80 

®100^ ,15 

*Pl 

|  .15 ) 

> 

Q200( 

•  X5  #  p  ^  |« 

15) 

for 

Pi" 

.15, .70, 

.80,. 

90,  . 

95 

and 

Q200^,2° 

'Pi 

|-15) 

> 

Q200( 

.20, |. 

15) 

for 

Pl= 

.20, .70. 

.80,  . 

90,  . 

95. 

An 

examination 

of 

the 

simulation  data 

of 

length  200  indicates 

that 

for 

all 

the  above 

values 

of 

n.Pl 

and  P2 

e  i 

ther 

the 

ninety- 

fifth 

or 

the 

ninety-sixth  largest  simulated  observations  equal  ®100 ^p0' P1 ln^  • 


discrepancy  is  minor  and  does  not  deserve  further  attention. 

Note  that  when  n>p q,  Q^Pq.PiIjt)  >  Qf  (p()'Pl  |po^  *  a  res°lt» 
when  Q^pQ.PilpQ)  is  used  as  a  critical  threshold  and  noisy  data  are 
produced  with  success  probability  n  >  pq,  the  probability  of  a  type  I 
error  (false  alarm)  becomes  considerably  higher.  For  example,  for  noisy 
strings  of  length  50  generated  with  success  probabilities  .15  and  .20  the 


95 


probabilities  Pr {M( .10, . 95 )  1  Q^gC.lO,  .95  |.10}  are  estimated  to  be  .15 
and  .30.  For  noisy  strings  of  length  200  and  success  probabilities  .15 
and  .20,  Pr  {M(  .10, .  95 )  1  Qjqo  ^  .  95  |  .10}  is  estiaated  to  be  .43  and 
.67  respectively. 

The  coaplete  statistical  assessaent  of  the  procedure  requires  the 
investigation  of  the  probabilities  of  'false  alarm*  in  conjunction  with 
that  of  no  detection  of  a  signal  present  in  the  data.  To  this  purpose,  50 
i  wrings  of  length  L=  50,  100,  200  and  300  vere  generated.  The  strings 
consisted  of  signals  of  lengths  S=  5,  7,  9,  11,  13  and  15  of 
probability  of  success  a-  .90  iaplanted  into  noise  of  success  probability 
jp*  .15.  In  the  remainder  of  the  chapter  signals  and  noise  will  be 
understood  to  be  Bernoulli  variables  with  success  probabilities  er=  .90 
and  rt=  .15  respectively.  Signals  were  implanted  at  one  tenth  and  half  of 


the  noisy  string  length.  Values  used  as  critical  thresholds  will  be 
explained  further  on.  If  for  some  run  M(.,.)  exceeds  the  critical 
threshold  value,  the  substring  for  which  M(pg.pj)  is  attained,  is 
detected  by  the  procedure. 

Use  of  (a,  0)  curves  is  made  to  present  the  performance  of  the 
proposed  procedure  when  applied  with  parameters  Pg=.10, .15, .20  and 
Pj=.80, .90, .95  on  the  simulation  data.  The  (a,£)  curves  for  the  detection 
of  signals  of  length  S  implanted  in  noisy  strings  of  length  L-S  are 
curves  passing  through  the  points  (a^,|J.).  and  correspond  to  the 
choice  of  several  critical  thresholds  Ci>  Criteria  were  chosen  to  be 
the  midpoints  between  the  values  attained  by  the  maximum  GLLR  M(pg,p1) 
for  the  100  noisy  strings  and  the  50  strings  where  signals  were 
implanted,  is  the  estimated  probability  of  a  'false  alarm*  when  the 
procedure  with  parameters  pg  and  p^  and  critical  threshold  C;  is  applied 


96 


to  noisy  data  of  length  L.  3  x  is  the  estimated  probability  of  no 

detection  when  the  same  procedure  (with  parameters  pg,  p^  and  C^)  is 

applied  to  noisy  strings  within  which  signals  of  length  S  have  been 
implanted,  as  explained.  depends  on  n,  the  test  parameters  C^,  pg, 

and  P2  and  the  total  signal  length  L.  In  addition  to  these  parameters, 
3 ^  depends  on  a  (0s  .90  in  this  study),  signal  length  S  and  the  position 

in  which  the  signal  is  implanted  within  the  overall  string.  This 

dependence  of  and  3j  is  not  explicitly  denoted  to  avoid  making 

expressions  cumbersome. 

Figure  5-la  presents  the  nine  (a, 3)  curves  corresponding  to  the 
choices  pg=.10, .15, .20  and  p^=. 80, . 90, . 95  for  the  detection  of  a  signal 
of  length  5  implanted  at  the  first  tenth  of  the  noisy  string  of  45 

characters.  Figures  5-lb  and  5-lc  plot  the  same  curves  for  signals  of 

length  7  and  9  implanted  at  the  first  tenth  of  noise,  the  overall 
strings  being  50  characters  long.  Figure  5-ld  plots  all  27  curves  in  the 
same  frame.  Figures  5-2a  to  5-2d  present  the  same  plots  for  signals  of 

length  5,  7  and  9  implanted  in  noisy  strings  at  the  first  tenth  of  noise, 

overall  strings  being  300  characters  long.  Figures  5-3  and  5-4  plot  the 
corresponding  curves  for  signals  of  length  5,7,9  implanted  at  the  middle 
of  noise,  overall  strings  being  50  and  300  characters  long. 

The  proposed  procedure  is  rather  powerless  in  detecting  signals  of 
length  5  implanted  in  noisy  strings  of  295  characters.  When  the  signal  is 
implanted  at  the  first  tenth  of  the  total  string  length,  for  all  nine 
values  of  (pg.pj^)  pQ=. 10, . 15 , . 20  and  p^.  80, .  90,  . 95  there  is  no  critical 
threshold  value  for  which  the  two  estimated  probabilities  of  error 
and  3^  are  both  less  than  15%.  This  is  illustrated  in  figure  5-2a;  lying 
on  the  unit  square,  the  (a, 3)  curves  do  not  cross  the  square 


[0. ,  .15]x[0. , .15] .  Figure  5-4a  illustrates  that  when  a  signal  of  5 
characters  is  implanted  at  the  middle  of  the  overall  string  of  300 
characters,  for  no  critical  thresholds  are  the  estimates  of  the 

probabilities  of  two  kinds  of  error  less  than  20%  because  it  is  very 

likely  that  in  a  noisy  string  of  300  characters  and  success  probability 
there  will  exist  a  string  of  no  less  than  four  successes.  Neither  is  the 
procedure  particularly  powerful  in  detecting  a  signal  of  five  characters 
implanted  within  a  noisy  string  45  characters.  When  the  signal  is 
implanted  at  the  first  tenth  of  the  overall  string,  there  are  criteria 
for  which  both  and  0^  are  both  less  than  15%,  but  not  less  than  10%. 
When  the  signal  is  implanted  at  the  middle  of  the  overall  string  length, 
for  C=6.98,  pg=.10  and  p^=.80,  a= . 05  and  f}=.08. 

The  curves  in  figures  5-lc,  5-2c,  5-3c  and  5-4c  indicate  that  the 
proposed  procedure  is  quite  powerful  in  detecting  signals  of  length  9 
implanted  in  noisy  stings  of  length  45  and  295.  Since  the  scales  in  which 

the  (a,($)  curves  are  drawn  do  not  allow  the  estimates  of  the 

probabilities  of  the  two  kinds  of  errors  to  be  read,  test  parameters 
(critical  threshold  C,  pg  and  p^)  for  which  estimated  probabilities  for 
the  two  kinds  of  errors  are  small,  are  presented  in  tables  5-9  to  5-12. 


98 


Table  5-9.  Critical  Tallies  and  estimates  of  the  probabilities  of  the  two 
kinds  of  errors  when  detecting  a  signal  of  length  9  implanted  at  the 
first  tenth  of  a  noisy  string  of  41  characters  by  MCp^.p^).  *=. 15, 0=. 90 


p<r*10 


po3*15 


p0-.20 


Pi 


.80 

.90 

.95 

c 

a 

0 

C 

a 

0 

C 

a 

0 

6.98 

.05 

.0 

7.85 

.04 

.0 

7.32 

.04 

.0 

7.89 

.04 

.0 

9.00 

.03 

.02 

8.05 

.04 

.02 

9.19 

.03 

.0 

10.1 

.03 

.04 

8.79 

.03 

.02 

9.89 

.03 

.02 

11.3 

.02 

.04 

9.59 

.03 

.04 

10.9 

.03 

.04 

12.6 

.0 

.06 

10.7 

.02 

.04 

12.6 

.00 

.06 

14.6 

.0 

.08 

12.6 

.0 

.06 

6.31 

.04 

.0 

6.08 

.04 

.0 

6.05 

.04 

.02 

7.05 

.03 

.0 

6.65 

.04 

.02 

6.99 

.03 

.02 

7.37 

.03 

.02 

7.20 

.03 

.04 

8.54 

.02 

.04 

8.20 

.03 

.04 

7.93 

.03 

.04 

10.4 

.0 

.06 

9.16 

.02 

.04 

8.81 

.02 

.04 

10.4 

.0 

.06 

10.4 

.0 

.06 

5.19 

.04 

.0 

4.85 

.04 

.0 

4.93 

.04 

.02 

5.56 

.04 

.02 

5.15 

.04 

.02 

5.73 

.03 

.02 

6.01 

.03 

.02 

5.94 

.03 

.02 

7.24 

.02 

.04 

6.69 

.03 

.04 

7.24 

.02 

.04 

8.85 

.0 

.06 

7.49 

.02 

.04 

8.85 

.0 

.08 

8.85 

.0 

.0 

Table  5-10.  Critical  thresholds  and  estimates  of  the  probabilities  of  the 
two  kinds  of  errors  when  detecting  a  signal  of  length  9  implanted  at  the 
fir^J  tenth  of  a  noisy  string  of  291  characters  by  M(pQ.p^).  ir=.15. 


po3-10 


p0=.15 


p0=.20 


Pi 


.80 

.90 

.95 

C 

a 

0 

C 

a 

0 

C 

a 

0 

9.27 

.06 

.02 

10.1 

.02 

.04 

9.59 

.02 

.04 

9.73 

.03 

.02 

11.2 

.0 

.14 

10.3 

.01 

.14 

10.5 

.02 

.04 

11.5 

.0 

.18 

11.3 

.01 

.04 

7.04 

.17 

.02 

7.92 

.02 

.04 

7.42 

.13 

.06 

7.26 

.15 

.02 

8.43 

.01 

.14 

7.91 

.01 

.14 

7.48 

.14 

.04 

9.05 

.0 

.14 

8.86 

.0 

.14 

8.13 

.02 

.04 

8.74 

.01 

.04 

6.00 

.14 

.04 

5.90 

.14 

.04 

'5.19 

.13 

.06 

6.68 

.02 

.04 

6.40 

.13 

.06 

6.51 

.01 

.14 

6.97 

.01 

.14 

6.69 

.01 

.14 

7.73 

.0 

.14 

D-A153  605  TOWARDS  fl  STATISTICAL  ANALVSIS  OF  GENETIC  SEQUENCES 
DATA  WITH  PARTICULAR .  .  (U)  MASSACHUSETTS  INST  OF  TECH 
CAMBRIDGE  STATISTICS  CENTER  S  P  ARSENIS  MAR  85 
UNCLASSIFIED  TR-26-0NR  N00814-74-C-0555  F/G  6/3 


100 


Since  for  all  nine  choices  of  (pQ«p^)>  there  are  thresholds  for 
which  the  probabil ities  of  the  two  types  of  error  are  both  less  than  .05 
the  proposed  procedure  is  illustrated  to  be  quite  powerful  and  robust  in 
detecting  a  signal  of  length  9  implanted  in  the  middle  of  noisy  strings 
as  long  as  291  characters.  The  procedure  is  weaker  when  the  signal  is 
implanted  at  the  first  tenth  of  the  noisy  string. 

When  a  signal  of  7  characters  is  implanted  at  the  first  tenth  of 
a  noisy  string  of  293  characters,  for  no  values  of  test  parameters  C, 
Pq  and  p^.  are  the  probabilities  of  both  types  of  error  less  than  10%. 
When  the  signal  is  implanted  in  the  middle  of  the  noisy  string  (of  293 
characters),  it  is  only  for  Pq-.IO  and  p^=.80  that  both  probabilities  can 
become  less  than  10%.  In  particular  for 
09.40  a=.05  and  j3=.02 

and  09.73  a= . 03  and  f)=.08. 

The  procedure  is  more  powerful  in  detecting  a  signal  of  length  7  within 
noise  43  characters  long.  Test  parameters  for  which  the  two  types  of 
error  are  less  than  10%  are  given  in  table  5-13.  Each  cell  of  table 
5-13  considered  as  a  three-way  table  comprises  of  two  triplets  for  C,  a 
and  3:  the  top  for  signals  implanted  at  the  first  tenth  of  the  noisy 
string  and  the  bottom  for  signals  implanted  at  the  middle. 


101 


Table  5-13.  Critical  thresholds  and  estimates  of  the  probabilities  of  the 
two  kiads  of  errors  when  detecting  a  signal  of  length  7  implanted  at  one 


tenth  (above)  and  the 

middle 

(below) 

of  a 

noisy  string  of 

43 

characters  by  JMpq.p^) 

.  a-. 15 

,  «*.90. 

*1 

• 

.80 

.90 

.95 

C 

a 

fi 

C 

a 

fi 

C 

a 

0 

6.98 

.05 

.0 

7.85 

.04 

.04 

7.64 

.04 

.04 

6.98 

.05 

.0 

7.85 

.04 

.06 

7.64 

.04 

.06 

7.22 

.04 

.0 

9.00 

.03 

.06 

8.79 

.03 

.06 

7.22 

.04 

.0 

9.00 

.03 

.08 

9.18 

.03 

.08 

8.15 

.04 

.04 

p0“.10 

7.68 

.04 

.04 

9.06 

.03 

.04 

8.43 

.04 

.06 

9.34 

.03 

.08 

9.19 

.03 

.06 

6.31 

.04 

.04 

6.26 

.04 

.04 

6.05 

.04 

.04 

5.69 

.04 

.04 

6.26 

.04 

.06 

6.05 

.04 

.06 

P0=.15 

7.05 

.03 

.04 

7.20 

.03 

.06 

6.99 

.03 

.06 

6.32 

.04 

.06 

7.54 

.03 

.08 

6.82 

.03 

.08 

7.37 

.03 

.06 

7.05 

.03 

.06 

5.20 

.04 

.04 

5.14 

.04 

.04 

4.93 

.04 

.04 

5.20 

.04 

.06 

5.14 

.04 

.06 

4.93 

.04 

.06 

P0-.20 

6.01 

.03 

.06 

5.94 

.03 

.06 

5.73 

.03 

.06 

6.25 

.03 

.08 

5.90 

.03 

.08 

5.91 

.03 

.08 

The 

two 

errors 

considered  thus  far  were 

"false  alarms 

"  and 

detection 

when 

a  signal  is 

present. 

It 

is  possible  however. 

that 

procedure  detects  a  signal  bat  detection  is  not  accurate.  Detection  is 
perfectly  accurate  when  the  substring  mazimizing  the  GLLR  is  identical 
to  the  implanted  signal.  However,  given  that  errors  are  allowed  within 
signals,  perfectly  accurate  detection  is  overly  restrictive;  in  analyzing 
the  sinulation  data  for  accurate  detection,  allowance  has  to  be  Bade  for 
moderate  deviations  between  the  two  substrings.  These  deviations  are 
measured  in  an  ad  hoc  fashion  by  the  sum  of  the  distances  between  the 
beginning  and  endpoints  of  the  two  substrings.  Formally,  if  the  implanted 


E 


2 


signal  within  ,Z2, . . . ,Z^  is  Z^,Z^+i, . . . ,Zg  and  the  substring  maximizing 
Mfpo^PiHs  zi»Zj+1, . . .  ,Zj,  the  deviation  between  the  two  substrings  is 
taken  to  be  D=  jl-A |+-  |j-B  | .  The  detection  of  the  implanted  signal  is 
considered  accurate  if  the  sum  is  not  larger  than  the  smallest  integer 
larger  than  half  the  length  of  the  implanted  si»*>*'.  In  particular, 
signals  of  length  5,7,9,11,13  and  15  are  considered  to  be  accurately 
detected  if  the  sum  is  not  larger  than  3, 4, 5, 6, 7  or  8.  Since  the 
performance  of  the  proposed  procedure  in  detecting  a  signal  of  length  5 
is  not  satisfactory,  its  performance  in  detecting  accurately  will  be 
examined  only  for  signals  of  length  7,9  and  11. 

Figures  5-5a,  5-5b  and  5-5c  plot  the  (a,0)  curves  for  accurate 
detection  of  signals  of  length  7,  9  and  11  implanted  at  the  first  tenth 
of  noisy  strings,  the  overall  string  length  being  50.  Mine  curves  are 
plotted  on  each  frame,  corresponding  to  (pq,Pj)  f°r  the  choices 
Pq=.10, .15, .20  and  p^=. 80 , . 90, , 95 .  Figure  5-5d  superimposes  all  27  curves 
on  the  same  frame.  Figures  5-6  plot  the  same  curves  for  signals  of  length 
7,9  and  11  implanted  in  noise,  the  overall  string  length  being  300. 

On  some  of  the  plots  on  figures  5-5  and  5-6,  the  probability  of 
accurate  detection  cannot  be  made  larger  than  98%,  i.e.  0  cannot  be  made 
less  than  2%,  no  matter  how  large  the  a,  i.e.  even  for  very  small 
critical  thresholds.  This  is  so  because  a  relatively  large  number  of 
mismatches  in  the  implanted  signal  may  cause  a  substring  of  the  noisy 
string  to  maximize  the  GLLR  in  the  overall  string.  Figure  5-7  lists  the 
substrings  maximizing  the  GLLR  and  the  maximum  GLLR  MCpq^^)  attained  for 
the  nine  choices  of  (Pq,p^)  when  the  procedure  is  applied  to  detect  a 
signal  implanted  at  sites  5  to  11  and  the  overall  string  length  is  50 
characters.  With  one  exception  marked  on  the  figure,  in  all  50  runs  the 


103 


substrings  maximizing  tbs  GLLS  are  close  to  the  implanted  signal. 

Since  the  scales  on  which  figures  5-5  and  5-6  are  drawn  do  not 
allow  the  probabilities  of  'false  alarms”  and  no  detection  or  non- 
accnrate  detection  to  be  read  off.  criteria  for  which  the  estimated 
probabilities  are  small  are  listed  in  tables.  Tables  5-14  and  5-15  list 
criteria  for  which  the  estimated  probabilities  of  the  two  kinds  of  errors 
are  small  when  the  procedure  is  applied  to  detect  accurately  signals  of 
length  7  and  9  implanted  at  the  first  tenth  of  noisy  strings  of  length  43 
and  41. 

Table  5-14.  Critical  thresholds  and  estimates  of  the  probabilities  of  the 
two  kinds  of  errors  when  detecting  accurately  a  signal  of  length  7 
implanted  at  the  first  tenth  of  a  noisy  string  of  43  characters  by 


C 

.80 

a 

P 

C 

*1 

.90 

a 

P 

C 

.95 

a 

P 

p0*.10 

6.98 

.05 

.08 

7.85 

.04 

.10 

7.64 

.04 

.08 

7.22 

.04 

.08 

9.00 

.03 

.12 

8.79 

.03 

.16 

p<r-13 

7.05 

.03 

.10 

6.26 

.04 

.08 

6.05 

.04 

.06 

7.37 

.03 

.12 

7.20 

.03 

.10 

6.99 

.03 

.08 

P0-.2° 

5.20 

.04 

.06 

5.14 

.04 

.06 

4.93 

.04 

.06 

6.01 

.03 

.08 

5.94 

.03 

.08 

5.73 

.03 

.08 

Table  5-15.  Critical  thresholds  and  estimates  of  the  probabilities  of  the 
two  kinds  of  errors  when  detecting  accurately  a  signal  of  length  9 
implanted  at  the  first  tenth  of  a  noisy  string  of  41  characters  by 


c 

.80 

a 

P 

C 

*1 

.90 

a 

P 

C 

.95 

a 

P 

Po-.io 

7.98 

.04 

.02 

10.1 

.03 

.04 

8.79 

.03 

.02 

9.19 

.03 

.02 

11.3 

.02 

.04 

10.7 

.02 

.04 

Po“*15 

7.37 

.03 

.02 

7.20 

.03 

.02 

6.99 

.03 

.02 

9.16 

.02 

.04 

8.83 

.02 

.04 

8.54 

.02 

.04 

p„-.20 

6.01 

.03 

.02 

5.94 

.03 

.02 

5.73 

.03 

.02 

7.49 

.02 

.04 

7.24 

.02 

.04 

7.24 

.02 

.04 

104 


The  procedure  is  quite  powerful  and  robust;  accurate  detection  is 
accomplished  with  errors  of  small  probability  (less  than  5%)  when  the 
procedure  parameters  are  selected  to  be  p0=. 10, .15, .20  and  pj=. 80, .90, .95 
while  the  probabilities  of  success  in  noise  and  signal  are  .15  and  .90. 
Since  the  performance  of  the  proposed  procedure  in  detecting  a  signal  of 
length  7  implanted  within  a  noisy  string  of  length  293  was  poor,  the 
operating  characteristics  of  the  procedure  are  given  for  the  accurate 
detection  of  signals  of  9  and  11  characters  only. 

Table  5—16  Critical  thresholds  and  estimates  of  the  probabilities  of  the 
two  kinds  of  errors  when  detecting  accnrately  a  signal  of  length  9 
implanted  at  the  first  tenth  of  a  noisy  string  of  291  characters  by 
H(PO'Pl) • 


Pi 

.80 

.90 

.95 

C 

a 

P 

C 

a 

P 

C 

a 

P 

p<r-10 

9.74 

.03 

.06 

10.1 

.02 

.04 

9.59 

.02 

.04 

10.6 

.02 

.08 

p0=.15 

8.75 

.01 

.04 

7.93 

.02 

.04 

7.42 

7.91 

.13 

.01 

.06 

.14 

p0=.20 

6.69 

.02 

.04 

6.40 

.13 

.06 

5.90 

.13 

.06 

6.6  9 

.01 

.14 

6.51 

.01 

.14 

Table  5-17.  Critical  thresholds  and  estimates  of  the  probabilities  of  the 
two  kinds  of  errors  when  detecting  accnrately  a  signal  of  length  11 
implanted  at  the  first  tenth  of  a  noisy  string  of  289  characters  by 
*(P0»Pj)" 


Pi 

.80 

.90 

.95 

C 

a 

P 

C 

a 

P 

C 

a 

P 

Po-.io 

9.74 

.03 

.04 

10.1 

.02 

.0 

9.27 

.02 

.02 

10.6 

.02 

.06 

12.1 

.0 

.02 

p0--15 

8.75 

.01 

.02 

3.44 

.01 

.04 

7.91 

.01 

.06 

p0=.20 

6.97 

.01 

.02 

6.69 

.01 

.06 

6.51 

.01 

.06 

The  probabilities 

of 

the  two 

hinds  of 

errors 

listed 

in  tables  5- 

■16  and 

5 


10  for  accurate  detection  and  detection  are  not  substantially  different. 
The  procedure  is  quite  powerful  in  accurately  detecting  a  signal  of 
length  9  within  a  string  of  overall  length  300  and,  as  expected,  more 
powerful  and  remarkably  robust  in  accurately  detecting  a  signal  of  length 


The  two-dimensional  problem  to  detect  a  signal  within  two  words  X 
and  Y  is  now  briefly  addressed.  In  examining  character  matrices 
visually,  the  investigator  scans  each  diagonal  for  substrings  with  a  few 
occasional  mismatches.  In  an  analogous  fashion,  the  procedure  for  the  two 
dimensional  problem  transforms  blank  and  nonblank  characters  to  0's  and 
l’s  and  computes  the  maximum  GLLR  along  each  diagonal  for  selected  values 
of  Pq  and  pj .  Let  the  maximum  GLLR  along  the  matrix  diagonal  of  lag  k  be 
denoted  by  M(pg,p^,k).  If  M(pg,p1,k)  is  larger  than  some  critical  value, 
the  substrings  of  X  and  Y  along  the  diagonal  at  lag  k  for  which 
M(pg,p^,k)  is  attained  are  considered  to  be  realizations  of  a  common 
signal  in  the  data. 

Except  for  the  nominal  parameters  pg  and  pj,  the  critical  threshold 
should  depend  on  the  amino  acid  counts  in  X  and  Y;  it  is  chosen  to  be  the 
estimated  .95  quantile  of  the  distribution  of  Max (M(pg,p^,k) )  for  random 
permutations  of  the  words. 

The  proposed  procedure  has  been  applied  to  chorion  proteins  292  and 
18B  for  pg-.20  and  pj=.90.  Figure  5-8  plots  M(pg,p^,k)  at  each  matrix 
diagonal.  For  100  permutations  of  protein  18B  the  29  largest  values  of 
Max (M(pg, pj ,k) }  are:  11.3,  9.66(3),  8.45,  8.05(24).  (  Numbers  in 
parenthesis  denote  ties.)  Hence,  with  8.2  as  a  critical  value  the 
procedure  detects  nine  signals  in  (nine)  matrix  diagonals.  The 
ralizations  of  the  signals  in  the  data  are  listed  in  decreasing  order  in 


M(p0<Pi».)  ia  table  5-18. 


Table  5-18.  Realization*  of  signals  detected  in  proteins  292  and  18B 
LAC  *<P0 

YGGEG IGNVAVAGELPVAGTTA VAGQ VP I I GAVDFCGHANAGGCVS I GGRCTGCGCGCG 
~12  YGGTGIGNVAVAGELPVAGKTAVGGQVPIIGAVGFGGTAGAAGCVSIAGRCGGCGCGCG 

MSTFAFLFLC I QACL 
0  MSTFAFLLLCAQACL 

GGLG YEGLGY  G ALGY 
~3  GGLGYGGLGYGGLGY 

GYEGLGYGALGYDGLGY 
"10  GYGGLGYGGLGYGGLGY 

GYGALGYDGLGYG 
“15  GYGGLGYGGLGYG 

GGLGYEG 
3  GGLGYEG 

CGCGGLG 
_11  CGCGGLG 

GCGCGCG 
"10°  GCGCGCG 


-20 


GYDGLGYG 

GYGGLGYG 


a. a 

a .  4  as 

a. a 

a. 4  - 

ALPHA 

ALPHA 

V.C! 

(<L' 

Figure  5-1. (a, 0)  curves  for  detection  by  M(pQ,p1)  of  a  signal  implanted 
within  noise  at  first  tenth  of  noisy  string  .  Overall  string  length  L*50. 
Pq-.IO,  .15.  .20,  and  p1  =  .80, .  90, .  95  .  jt=.10,  <t=.90.  (a)  Signal  5  characters 
long,  (b)  Signal  7  characters  long,  (c)  Signal  9  characters  long,  (d) 


superimposes  plots  of  a.b  and  c  on  one  frame. 


AL.®MA 


AL.**A 


<i) 

Figure  5-2.  (a,p)  carves  for  detection  by  M(p0,p^)  of  a  signal  implanted 
within  noise  at  first  tenth  of  noisy  string.  Overall  string  length  L=300. 
Pq*.  10.  .15 ,  .20,  and  pj* . 80, . 90, . 95 .  jt«.10,  cr*.90.  (a)  Signal  5  characters 
long,  (b)  Signal  7  characters  long,  (c)  Signal  9  characters  long,  (d) 
superimposes  plots  of  a,b  and  c  on  one  frame. 


Figure  5-3.  (a, 3)  curves  for  detection  by  51(pg,p^)  of  a  signal  implanted 
within  noise  at  the  middle  of  noisy  string  .  Overall  string  length  L=50. 
pg*. 10, .15, .20,  and  pj» . 80, . 90,  . 95 .  jt*.10,  oa.90.  (a)  Signal  5  characters 
long,  (b)  Signal  7  characters  long,  (c)  Signal  9  characters  long,  (d) 
superimposes  plots  of  a,b  and  c  on  one  frame. 


ft] 


o 


ALPHA 


ALPHA 


Figure  5-4.  (a,g)  curves  for  detection  by  M(p0,Px)  of  a  signal  implanted 
within  noise  at  the  middle  of  noisy  string.  Overall  string  length  L-300. 
p0= . 10, . 15 , .20 ,  and  Pl-. 80. .90. .95.  it-. 10.  a-. 90.  (a)  Signal  5  characters 
long,  (b)  Signal  7  characters  long,  (c)  Signal  9  characters  long.  ( d) 


superimposes  plots  of  a,b  and  c  on  one  frame. 


Ill 


ALPHA 


0  .  BLjC 
a. a 


a.co= 
a. a 


ALPHA 


alpha 


Figure  5-5.  (a,p)  curves  for  accurate  detection  by  M(pQ,p^)  of  a  signal 
implanted  within  noise  at  first  tenth  of  noisily  string.  Overall  string 
length  L-50.  pQ-. 10. .15, .20,  and  p^. 80,  .90,  .95.  n-.10,  e«.90.  (a)  Signal 
7  characters  long,  (b)  Signal  9  characters  long,  (c)  Signal  11  characters 
long,  (d)  superimposes  plots  of  a,b  and  c  on  one  frame. 


•  •*  wgM*»  a*  LiuiNi-  a«  ataiHc  iw 


substring  of  43  characters. 


115 


APPENDI3  1. 


This  appendix  derives  the  GLLR  test  statistic  for  the  hypothesis 
that  within  the  string  of  independent  binary  variables  Z^.Zj.  •  •  •  there 
exists  a  substring  Z^ ,Z^+^ , . . . Zj  such  that  the  probability  of  success 
within  the  substring  is  larger  than  that  outside  it.  The  GLLR  test 
statistic  above  is  related  to  the  test  statistic  of  equation  (5-7). 

Let 


Z1*Z2» . . . ,Zi_i.Zi, . . . ,Zj ,Zj+i, . . ,ZN  (A-l) 
be  a  string  of  independent  binary  variables.  We  shall  refer  to  l's  and 
0's  as  successes  and  failures.  Suppose  that  the  success  probability  for 
the  substring 


Z1»Z2* 

is  p0  and  that  for 


*Zi-l'Zj+l 


9 • • • 


Zi'Zi+l*  * • ’ ,Zj 

is  p. 

Let  0<Pq<p^<1.  We  are  interested  in  testing  the  hypothesis 

H0:  P=Pq  vs*  HA;P-Pl- 


(A-2) 


(A-3 ) 


( A-4 ) 


If  and  Sq  are  the  numbers  of  successes  and  failures  for  the  substring 
in  (A-3)  and  and  Tq  are  the  number  of  successes  and  failures  in  the 
substring  in  (A-2),  under  H^, 

Pr(TO=tO'Tl=tl'S0=s0'Srsl) 


rru] 


d-p0) 


fc0 


(1-p) 


0' 


and  under  H 


6 


JSj+tj  SQ+trt 

p0  (1-po>  * 


Hence  the  GLR  for  the  hypothesis  (A-4)  is: 


Sap 

i<j  P^Pi 


r-(j-i+l)]  rj-i+ll  *1  M  Ns0  tl/1  *0 

tl  J  L  Sl  J  p  (1'p)  po  (1"p0> 


(1-P0) 


Sap 

i<j  P-Pi 


n;™j  i-n  <p/po>si<<i-p>/<i-po>>s° 


f  n  i 

Lsi+ti  J 


rv"')  K1) 

Lk ) 


Sap  (p/pg)  (  (l-p)/(l-pg)  ) 
P-Pl 


Therefore  the  GLLR  test  statistic  for  the  hypothesis  that  for  some 
substring  of  Z^,...,Z^  the  success  probability  is  not  smaller  than  p^  is 


=  Max  log 

i<j 


rr11]  rri 

[  a  ] 


+  Ljj (p0,Pl) 


for  (pg,pj)  defined  in  equation  (5-6).  MCpgjp^),  the  test  statistic 
of  chapter  5,  neglects  the  first  term  and  equals 


Max  L^j (pQ*Pi^  • 


117 


BIBLIOGRAPHY 

1.  Bickel  P.J.  and  Doksum  K.A.  (1977).  Mathematical  Statistics.  Holden- 
Day ,  San  Francisco. 

2.  Chernoff  H.  (1954).  "On  the  Distribution  of  the  Likelihood  Ratio." 
Ann.  Math.  Stat.  23,  573-578. 

3.  Chung,  K.L.  (1974).  A  Course  in  Probability  Theory.  2nd  edition. 
Academic  Press,  New  York. 

4.  Chvatal  V.  and  Sankoff  D.  (1975).  "Longest  Common  Subsequences  of  Two 
Random  Sequences."  J.  Appl.  Prob.  12,  306-315. 

5.  Deken  J.G.(1976).  "On  Records:  Scheduled  Maxima  Sequences  and  Longest 
Common  Subsequence."  Technical  report  No  91,  Department  of 
Statistics,  Stanford  University. 

6.  Dayhoff,  M.O.  (1972).  Atlas  of  Protein  Sequence  and  Structure. 
National  Biomedical  Research  Foundation,  Washington  D.C. 

7.  Gibbs  A.J.  and  McIntyre  G.A.  (1970).  'The  Diagram,  a  Method  of 
Comparing  Sequences.  Its  Use  with  Amino  Acid  and  Nucleotide 
Sequences."  Eur.  J.  Biochemistry  16,  1-11. 

8.  Hood,  L.E.,  Wilson  J.H.  and  Wood  W.B.  (1975).  Molecular  Biology  of 
Eucariotic  Cells.  Benj amin/Cummings ,  Menlo  Park,  Calif. 

9.  Lehmann,  E.L.  (1975).  Nonparametr ics.  Holden-Day,  San  Francisco, 

Ca) if. 

10.  Mahan,  B.H.  (1969).  University  Chemistry.  Addi son-Wes ley,  Reading, 


Mass . 


11.  Needleman  S.B.  and  Wanch  C.D.(1970).  A  General  Method  Applicable  to 
the  Search  for  Similarities  in  the  Amino  Acid  Sequence  of  Two 
Proteins."  J.Mol  Biol.  48,  443-453. 

12.  Sao  C.R.  (1973).  Linear  Statistical  Inference  and  Its  Applications. 


2nd  edition,  Wiley,  New  York. 

13.  Sankoff  D.  (1972)  "Matching  Sequences  under  Deletion/Insertion 
constraints."  Proc.  Mat.  Acad.  Sci.  USA  vol  69,  No  1,  4-6 

14.  Steele  M.J.  (1980).  "Long  Common  Subsequences  and  the  Proximity  of 
Two  Random  Strings."  Technical  report.  Department  of  Statistics, 
Stanford  University. 

15.  Watson,  J.D.  (1975).  The  Molecular  Biology  of  the  Gene.  Benjamin, 
Menlo  Park,  Calif. 


UNCLASSIFIED 


SECURITY  CLASSIFICATION  OF  ’hiS  FiSE  r»hi,  D«l  Enfrmd) 


REPORT  DOCUMENTATION  PAGE 


I  RE  POR  T  NUMBER 


«.  title  rtnd  Sutmi*) 


l 


Towards  A  Statistical  Analysis  of  Genetic 
Sequences  Data  With  Particular  Reference 
To  Protein  Sequences 


7.  AUTHOP<«> 


m 


READ  INSTRUCTIONS 
BEFORE  COMPLETING  FORM 


ECIPIEnT'S  catalog  number 


5.  TYPE  of  REPORT  a  pebioo  covereo 


Technical  Report 


s.  performing  org.  report  number 


S.  CONTRACT  OR  GRAN T  NUMBEROI 


Spyros  P.  Arsenis 


N00014-75-C-0555 


’>  CONTROLLING  OFFICE  NAME  ANO  AOORESS 

Office  of  Naval  Research 
Statistics  and  Probability  Code  436 
Arlington,  Virginia  22217 


12.  REPORT  OATE 

: March  1985 

I  •  J  number  of  PAGES 

;  ns 


14  MON»TOR|NG  AGENCY  NAME  S  AQCRESSnf  <iil(irmnt  from  Controlling  Offleot  \  '5  SEC'Jfl^Y  CLASS.  ol  thfa  report', 

\ 

! Unclassified 


5«  DEC’*.  *SSi  P*C  *  *!CN  DOWNGRADING 
SCHEDULE 


'6  ji‘T9i3tjT!CN  S’aTEmEn''  oi  rhia  S+port) 


A'DP07ED  FOP  PUBLIC  RELEASE :  DISTRIBUTION  UNL!>'i"T 


DISTRIBUTION  ST  A  T  Em  En  t  of  thm  abarrmct  nfrmd  in  Block  20,  If  different  from  import) 


Y  >QRQS  'Continue  on  rmvmr  aide  if  -\eceaaery  And  Identify  by  block  nutnb»r 

"Genetic  Sequences,  DNA,  Matrix  Smear,  Character  Matrix  Graphicst 


20  *®S*  3  AC*  '~jnttnu»  :n  •#<;*  f  nmceaamry  a nd  d^ntify  bv  Moca 


See  reverse  side 


UNCLASSIFIED 


sieumrv  classification  of  this  »aok  r*fc«n  omm  Eat—o 


ABSTRACT 


This  report  develops  a  variety  of  character  matrices  as  graphical  tools 
for  the  visual  examination  of  genetic  sequences  and  in  particular  protein 
sequences.  The  NNC,  PNC,  BNCl,  BNC2  and  BNC3  matrices  are  designed  to  filter 
noise  without  severely  suppressing  signals  in  the  CC  matrix.  The  Matrix  Smear 
of  a  character  matrix  is  introduced  as  a  measure  of  signals  and  noise  in  the 
matrix.  The  asymptotic  distribution  of  the  smears  of  the  CC  and  NNC  matrices 
are  derived  under  the  independence  model.  The  asymptotic  result  is  used  in 
conjunction  with  exact  confidence  intervals  from  diagonal  smears  to  automate 
partially  the  visual  examination  of  character  matrices.  A  generalized  likeli¬ 
hood  ratio  procedure  is  developed  to  automate  fully  the  detection  of  signals 
in  two  protein  sequences.  A  simulation  study  has  proven  the  procedure  to 
be  powerful  and  robust  in  detecting  signals  of  success  probability  .90  and 
length  9  implanted  within  noisy  binary  strings  of  iength  291  characters  and 

-  ^  v  j.  ';  v  vj 


success  probability  .15. 

\  \ 


END 

FILMED 

6-85 

DTIC 


